Loan Prediction¶

Instructions¶

Step 1: Clean and prepare your data: There are several entries where values have been deleted to simulate dirty data. Please clean the data with whatever method(s) you believe is best/most suitable. Note that some of the missing values are truly blank (unknown answers) and thus may be impossible to clean; use your discretion.

Step 2: Build your models: Please build machine learning/statistical models in Python to predict the interest rate assigned to a loan. When writing the code associated with each model, please have the first part produce and save the model, followed by a second part that loads and applies the model.

Step 3: Test your models using the data found within the "Holdout for Testing" file. Save the results of the final model (remember you will only predict the first column in holdout test set with your best model results) in a single, separate CSV titled "Results from" *insert your name or UChicago net ID.

Step 4: Submit your work: Please submit all of your code for cleaning, prepping, and modeling your data, your "Results" file, a brief write-up comparing the pros and cons of the modeling techniques you used (no more than a paragraph). Your work will be scored on techniques used (appropriateness and complexity), model performance - measured by RMSE - on the data hold out, an understanding of the techniques you compared in your write-up, and your overall code.

Meta Data¶

  1. X1: Interest Rate on the loan (y, target variable)
  2. X2: A unique id for the loan
  3. X3: A unique id assigned for the borrower
  4. X4: Loan amount requested
  5. X5: Loan amount funded
  6. X6: Investor-funded portion of loan
  7. X7: Number of payments (36 or 60)
  8. X8: Loan grade
  9. X9: Loan subgrade
  10. X10: Employer or job title (self-filled)
  11. X11: Number of years employed (0 to 10; 10 = 10 or more)
  12. X12: Home ownership status: RENT, OWN, MORTGAGE, OTHER
  13. X13: Annual income of borrower
  14. X14: Income verified, not verified, or income source was verified
  15. X15: Date loan was issued (Format: Mon-YY)
  16. X16: Reason for loan provided by borrower
  17. X17: Loan category, as provided by borrower
  18. X18: Loan title, as provided by borrower
  19. X19: First 3 numbers of zip code
  20. X20: State of borrower
  21. X21: A ratio calculated using the borrower's total monthly debt payments on the total debt obligations, excluding mortgage and the requested loan, divided by the borrower's self-reported monthly income
  22. X22: The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
  23. X23: Date the borrower's earliest reported credit line was opened (Format: Mon-YY)
  24. X24: Number of inquiries by creditors during the past 6 months
  25. X25: Number of months since the borrower's last delinquency
  26. X26: Number of months since the last public record
  27. X27: Number of open credit lines in the borrower's credit file
  28. X28: Number of derogatory public records
  29. X29: Total credit revolving balance
  30. X30: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit
  31. X31: The total number of credit lines currently in the borrower's credit file
  32. X32: The initial listing status of the loan. Possible values are - W, F!

1. Import packages and read file¶

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import plotly.express as px
from sklearn import metrics

# Importing Regressor Modules
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor


from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE


pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
In [2]:
loan_data = pd.read_csv('Data for Cleaning & Modeling.csv')
loan_data.head()
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/1970426666.py:1: DtypeWarning: Columns (15) have mixed types. Specify dtype option on import or set low_memory=False.
  loan_data = pd.read_csv('Data for Cleaning & Modeling.csv')
Out[2]:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32
0 11.89% 54734.0 80364.0 $25,000 $25,000 $19,080 36 months B B4 NaN < 1 year RENT 85000.0 VERIFIED - income Aug-09 Due to a lack of personal finance education an... debt_consolidation Debt consolidation for on-time payer 941xx CA 19.48 0.0 Feb-94 0.0 NaN NaN 10.0 0.0 28854.0 52.10% 42.0 f
1 10.71% 55742.0 114426.0 $7,000 $7,000 $673 36 months B B5 CNN < 1 year RENT 65000.0 not verified May-08 Just want to pay off the last bit of credit ca... credit_card Credit Card payoff 112xx NY 14.29 0.0 Oct-00 0.0 NaN NaN 7.0 0.0 33623.0 76.70% 7.0 f
2 16.99% 57167.0 137225.0 $25,000 $25,000 $24,725 36 months D D3 Web Programmer 1 year RENT 70000.0 VERIFIED - income Aug-14 Trying to pay a friend back for apartment brok... debt_consolidation mlue 100xx NY 10.50 0.0 Jun-00 0.0 41.0 NaN 10.0 0.0 19878.0 66.30% 17.0 f
3 13.11% 57245.0 138150.0 $1,200 $1,200 $1,200 36 months C C2 city of beaumont texas 10+ years OWN 54000.0 not verified Mar-10 If funded, I would use this loan consolidate t... debt_consolidation zxcvb 777xx TX 5.47 0.0 Jan-85 0.0 64.0 NaN 5.0 0.0 2584.0 40.40% 31.0 f
4 13.57% 57416.0 139635.0 $10,800 $10,800 $10,692 36 months C C3 State Farm Insurance 6 years RENT 32000.0 not verified Nov-09 I currently have a personal loan with Citifina... debt_consolidation Nicolechr1978 067xx CT 11.63 0.0 Dec-96 1.0 58.0 NaN 14.0 0.0 3511.0 25.60% 40.0 f

2. Data Cleaning & Preprocessing¶

  • Rename the column using meta data
In [3]:
loan_data.columns = ['interest_rate',
                    'loan_id',
                    'borrower_id',
                    'loan_requested',
                    'loan_funded',
                    'investor_funded',
                    'payment_numbers',
                    'loan_grade',
                    'loan_subgrade',
                    'job_title',
                    'years_employed',
                    'home_ownership',
                    'annual_income',
                    'income_verification',
                    'issued_date',
                    'loan_reason',
                    'loan_category',
                    'loan_title',
                    'zip_code',
                    'state',
                    'debt_payment_ratio',
                    'delinquency_frequency',
                    'credit_earliest_date',
                    'creditor_inquiries',
                    'borrower_delinquency_recency',
                    'public_record_recency',
                    'open_credit_line_count',
                    'derogatory_public_record_count',
                    'total_credit',
                    'credit_utilization_rate',
                    'total_credit_line_count',
                    'initial_status'
                    ]
In [4]:
loan_data.head()
Out[4]:
interest_rate loan_id borrower_id loan_requested loan_funded investor_funded payment_numbers loan_grade loan_subgrade job_title years_employed home_ownership annual_income income_verification issued_date loan_reason loan_category loan_title zip_code state debt_payment_ratio delinquency_frequency credit_earliest_date creditor_inquiries borrower_delinquency_recency public_record_recency open_credit_line_count derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status
0 11.89% 54734.0 80364.0 $25,000 $25,000 $19,080 36 months B B4 NaN < 1 year RENT 85000.0 VERIFIED - income Aug-09 Due to a lack of personal finance education an... debt_consolidation Debt consolidation for on-time payer 941xx CA 19.48 0.0 Feb-94 0.0 NaN NaN 10.0 0.0 28854.0 52.10% 42.0 f
1 10.71% 55742.0 114426.0 $7,000 $7,000 $673 36 months B B5 CNN < 1 year RENT 65000.0 not verified May-08 Just want to pay off the last bit of credit ca... credit_card Credit Card payoff 112xx NY 14.29 0.0 Oct-00 0.0 NaN NaN 7.0 0.0 33623.0 76.70% 7.0 f
2 16.99% 57167.0 137225.0 $25,000 $25,000 $24,725 36 months D D3 Web Programmer 1 year RENT 70000.0 VERIFIED - income Aug-14 Trying to pay a friend back for apartment brok... debt_consolidation mlue 100xx NY 10.50 0.0 Jun-00 0.0 41.0 NaN 10.0 0.0 19878.0 66.30% 17.0 f
3 13.11% 57245.0 138150.0 $1,200 $1,200 $1,200 36 months C C2 city of beaumont texas 10+ years OWN 54000.0 not verified Mar-10 If funded, I would use this loan consolidate t... debt_consolidation zxcvb 777xx TX 5.47 0.0 Jan-85 0.0 64.0 NaN 5.0 0.0 2584.0 40.40% 31.0 f
4 13.57% 57416.0 139635.0 $10,800 $10,800 $10,692 36 months C C3 State Farm Insurance 6 years RENT 32000.0 not verified Nov-09 I currently have a personal loan with Citifina... debt_consolidation Nicolechr1978 067xx CT 11.63 0.0 Dec-96 1.0 58.0 NaN 14.0 0.0 3511.0 25.60% 40.0 f
In [5]:
loan_data.shape
Out[5]:
(400000, 32)
In [6]:
na_value = loan_data.isna().sum()
na_value
Out[6]:
interest_rate                      61010
loan_id                                1
borrower_id                            1
loan_requested                         1
loan_funded                            1
investor_funded                        1
payment_numbers                        1
loan_grade                         61270
loan_subgrade                      61270
job_title                          23986
years_employed                     17538
home_ownership                     61361
annual_income                      61028
income_verification                    1
issued_date                            1
loan_reason                       276440
loan_category                          1
loan_title                            19
zip_code                               1
state                                  1
debt_payment_ratio                     1
delinquency_frequency                  1
credit_earliest_date                   1
creditor_inquiries                     1
borrower_delinquency_recency      218802
public_record_recency             348845
open_credit_line_count                 1
derogatory_public_record_count         1
total_credit                           1
credit_utilization_rate              267
total_credit_line_count                1
initial_status                         1
dtype: int64
  • calculate na % of each column to determine if we can drop it directly. Theoretically, 25% to 30% is the maximum missing values are allowed, beyond which we might want to drop the variable unless that variable is significant to the target variable.
In [7]:
na_df = pd.DataFrame(na_value)
na_df['NA_count'] = na_df
na_df['NA%'] = (na_df['NA_count'] / loan_data.shape[0])*100
na_df = na_df.iloc[:,1:3]
na_df
Out[7]:
NA_count NA%
interest_rate 61010 15.25250
loan_id 1 0.00025
borrower_id 1 0.00025
loan_requested 1 0.00025
loan_funded 1 0.00025
investor_funded 1 0.00025
payment_numbers 1 0.00025
loan_grade 61270 15.31750
loan_subgrade 61270 15.31750
job_title 23986 5.99650
years_employed 17538 4.38450
home_ownership 61361 15.34025
annual_income 61028 15.25700
income_verification 1 0.00025
issued_date 1 0.00025
loan_reason 276440 69.11000
loan_category 1 0.00025
loan_title 19 0.00475
zip_code 1 0.00025
state 1 0.00025
debt_payment_ratio 1 0.00025
delinquency_frequency 1 0.00025
credit_earliest_date 1 0.00025
creditor_inquiries 1 0.00025
borrower_delinquency_recency 218802 54.70050
public_record_recency 348845 87.21125
open_credit_line_count 1 0.00025
derogatory_public_record_count 1 0.00025
total_credit 1 0.00025
credit_utilization_rate 267 0.06675
total_credit_line_count 1 0.00025
initial_status 1 0.00025
  • For target variable(interest rate), it has 15% of missing values. Since target variable controls how the learning algorithm learns, it is not advised to impute it. Therefore, I will drop these 15% of missing values rows
In [8]:
loan_data = loan_data.dropna(subset=['interest_rate'])
In [9]:
na_value = loan_data.isna().sum()
na_df = pd.DataFrame(na_value)
na_df['NA_count'] = na_df
na_df['NA%'] = (na_df['NA_count'] / loan_data.shape[0])*100
na_df = na_df.iloc[:,1:3]
na_df
Out[9]:
NA_count NA%
interest_rate 0 0.000000
loan_id 1 0.000295
borrower_id 1 0.000295
loan_requested 1 0.000295
loan_funded 1 0.000295
investor_funded 1 0.000295
payment_numbers 1 0.000295
loan_grade 51867 15.300451
loan_subgrade 51867 15.300451
job_title 20257 5.975692
years_employed 14795 4.364436
home_ownership 51960 15.327886
annual_income 51752 15.266527
income_verification 1 0.000295
issued_date 1 0.000295
loan_reason 233934 69.009115
loan_category 1 0.000295
loan_title 17 0.005015
zip_code 1 0.000295
state 1 0.000295
debt_payment_ratio 1 0.000295
delinquency_frequency 1 0.000295
credit_earliest_date 1 0.000295
creditor_inquiries 1 0.000295
borrower_delinquency_recency 185457 54.708693
public_record_recency 295590 87.197262
open_credit_line_count 1 0.000295
derogatory_public_record_count 1 0.000295
total_credit 1 0.000295
credit_utilization_rate 225 0.066374
total_credit_line_count 1 0.000295
initial_status 1 0.000295
  • loan_reason has over 69% of missing value --> we can drop it.
  • borrower_delinquency_recency has over 54% of missing value --> we can drop it.
  • public_record_recency has over 87% of missing value --> we can drop it.
  • For other predictors with missing values, we will do EDA for each column individually.
In [10]:
loan_data = loan_data.drop(columns = ['loan_reason', 'borrower_delinquency_recency', 'public_record_recency'])
In [11]:
loan_data.shape
Out[11]:
(338990, 29)
In [12]:
loan_data.loan_grade.value_counts()
Out[12]:
B    86121
C    76446
D    46984
A    45525
E    21628
F     8395
G     2024
Name: loan_grade, dtype: int64
In [13]:
len(loan_data.loan_grade.value_counts())
Out[13]:
7
In [14]:
loan_data.loan_subgrade.value_counts()
Out[14]:
B3    20352
B4    19137
B2    16767
C1    16342
C2    16310
B5    15521
C3    15425
C4    14646
B1    14344
C5    13723
A5    13086
A4    11806
D1    11720
D2    10498
D3     9091
D4     8573
A3     7653
D5     7102
A2     6496
A1     6484
E1     5447
E2     5246
E3     4230
E4     3640
E5     3065
F1     2490
F2     1873
F3     1712
F4     1331
F5      989
G1      677
G2      511
G3      378
G4      252
G5      206
Name: loan_subgrade, dtype: int64
In [15]:
len(loan_data.loan_subgrade.value_counts())
Out[15]:
35
  • Since loan_id, borrower_id are not significat predictors for interest rate, I decided to drop them.
  • For loan_grade and loan_subgrade, they represent the same thing. Therefore, I would only select one of them. There are 7 categories in grade while there are 35 categories in subgrade. Choosing subgrade as a predictor and do dummy variable will be too complex and unnessary. Therefore, I selected grade and drop subgrade.
  • For zipcode and state, since both of them have overlap geographical meaning, I would only select one of them. I selected state because zipcode is too detailed and would be unnessary to analyze at this point.
In [16]:
loan_data = loan_data.drop(columns = ['loan_id', 'borrower_id', 'loan_subgrade', 'zip_code'])
In [17]:
loan_data.shape
Out[17]:
(338990, 25)
issued_date & credit_earliest_date¶
In [18]:
# issued_date
loan_data['issued_month'] = loan_data['issued_date'].apply(lambda x: str(x)[:3])
loan_data['issued_year'] = loan_data['issued_date'].apply(lambda x: str(x)[-2:])

# credit_earliest date
loan_data['credit_earliest_month'] = loan_data['credit_earliest_date'].apply(lambda x: str(x)[:3])
loan_data['credit_earliest_year'] = loan_data['credit_earliest_date'].apply(lambda x: str(x)[-2:])

loan_data = loan_data.drop(['issued_date', 'credit_earliest_date' ], axis=1)
In [19]:
loan_data['credit_earliest_month'].unique()
Out[19]:
array(['Feb', 'Oct', 'Jun', 'Jan', 'Dec', 'Apr', 'Nov', 'Jul', 'May',
       'Aug', 'Sep', 'Mar', 'nan'], dtype=object)
In [20]:
loan_data.loc[loan_data['credit_earliest_month'] == "nan"]
Out[20]:
interest_rate loan_requested loan_funded investor_funded payment_numbers loan_grade job_title years_employed home_ownership annual_income income_verification loan_category loan_title state debt_payment_ratio delinquency_frequency creditor_inquiries open_credit_line_count derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status issued_month issued_year credit_earliest_month credit_earliest_year
364111 7.69% NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN nan an nan an
  • drop this row because most of the columns in this row are na
In [21]:
loan_data = loan_data.loc[loan_data['credit_earliest_month'] != 'nan']
In [22]:
loan_data['credit_earliest_year'].unique()
Out[22]:
array(['94', '00', '85', '96', '98', '93', '01', '06', '95', '97', '90',
       '91', '03', '05', '75', '02', '88', '86', '87', '92', '89', '99',
       '04', '83', '80', '84', '71', '63', '81', '76', '82', '79', '72',
       '78', '73', '77', '74', '69', '70', '68', '67', '64', '07', '61',
       '62', '66', '65', '54', '08', '50', '46', '59', '56', '09', '58',
       '60', '10', '55', '53', '57', '11', '44', '51', '49'], dtype=object)
In [23]:
loan_data.loc[loan_data['credit_earliest_year'] == "an"]
Out[23]:
interest_rate loan_requested loan_funded investor_funded payment_numbers loan_grade job_title years_employed home_ownership annual_income income_verification loan_category loan_title state debt_payment_ratio delinquency_frequency creditor_inquiries open_credit_line_count derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status issued_month issued_year credit_earliest_month credit_earliest_year
  • drop this row because most of the columns in this row are na
In [24]:
loan_data = loan_data.loc[loan_data['credit_earliest_year'] != "an"]
In [25]:
loan_data['issued_month'].unique()
Out[25]:
array(['Aug', 'May', 'Mar', 'Nov', 'Apr', 'Sep', 'Feb', 'Dec', 'Jun',
       'Oct', 'Jul', 'Jan'], dtype=object)
In [26]:
loan_data.loc[loan_data['issued_month'] == "nan"]
Out[26]:
interest_rate loan_requested loan_funded investor_funded payment_numbers loan_grade job_title years_employed home_ownership annual_income income_verification loan_category loan_title state debt_payment_ratio delinquency_frequency creditor_inquiries open_credit_line_count derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status issued_month issued_year credit_earliest_month credit_earliest_year
  • drop this row because most of the columns in this row are na
In [27]:
loan_data = loan_data.loc[loan_data['issued_month'] != "nan"]
In [28]:
loan_data['issued_year'].unique()
Out[28]:
array(['09', '08', '14', '10', '12', '07', '11', '13'], dtype=object)
In [29]:
loan_data.loc[loan_data['issued_year'] == "an"]
Out[29]:
interest_rate loan_requested loan_funded investor_funded payment_numbers loan_grade job_title years_employed home_ownership annual_income income_verification loan_category loan_title state debt_payment_ratio delinquency_frequency creditor_inquiries open_credit_line_count derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status issued_month issued_year credit_earliest_month credit_earliest_year
  • drop this row because most of the columns in this row are na
In [30]:
loan_data = loan_data.loc[loan_data['issued_year'] != "an"]
In [31]:
loan_data['credit_earliest_year'].unique()
Out[31]:
array(['94', '00', '85', '96', '98', '93', '01', '06', '95', '97', '90',
       '91', '03', '05', '75', '02', '88', '86', '87', '92', '89', '99',
       '04', '83', '80', '84', '71', '63', '81', '76', '82', '79', '72',
       '78', '73', '77', '74', '69', '70', '68', '67', '64', '07', '61',
       '62', '66', '65', '54', '08', '50', '46', '59', '56', '09', '58',
       '60', '10', '55', '53', '57', '11', '44', '51', '49'], dtype=object)
In [32]:
loan_data['credit_earliest_month'].unique()
Out[32]:
array(['Feb', 'Oct', 'Jun', 'Jan', 'Dec', 'Apr', 'Nov', 'Jul', 'May',
       'Aug', 'Sep', 'Mar'], dtype=object)
In [33]:
loan_data['issued_year'].unique()
Out[33]:
array(['09', '08', '14', '10', '12', '07', '11', '13'], dtype=object)
In [34]:
loan_data['issued_month'].unique()
Out[34]:
array(['Aug', 'May', 'Mar', 'Nov', 'Apr', 'Sep', 'Feb', 'Dec', 'Jun',
       'Oct', 'Jul', 'Jan'], dtype=object)
In [35]:
# Define month convertion function
def mtn(x):
    months = {
        "jan": 1,
        "feb": 2,
        "mar": 3,
        "apr": 4,
        "may": 5,
        "jun": 6,
        "jul": 7,
        "aug": 8,
        "sep": 9,
        "oct": 10,
        "nov": 11,
        "dec": 12,
    }
    try: # convert all the month into lowercase to match them to numbers
        x = months[x.strip()[:3].lower()]
        return x
    except:
        pass

# Define year convertion function
# Since it's only 2023 now, I assume that last 2 digits greater than 30 is in 1900 century
# for last 2 digits less than 30, I assume that it's in 2000 century

def year(x):
    x = x.strip()[-2:]
    try:
        if float(x) > 30: 
            x = "19" + str(x)
        else:             # 
            x = "20" + str(x)
        return int(x)
    except:
        pass
In [36]:
loan_data["issued_month"].unique()
Out[36]:
array(['Aug', 'May', 'Mar', 'Nov', 'Apr', 'Sep', 'Feb', 'Dec', 'Jun',
       'Oct', 'Jul', 'Jan'], dtype=object)
In [37]:
# change issued_date & credit_earliest_date's Month & Year
loan_data['issued_month'] = loan_data['issued_month'].apply(lambda x: mtn(x))
loan_data['credit_earliest_month']  = loan_data['credit_earliest_month'] .apply(lambda x: mtn(x))
loan_data['issued_year'] = loan_data['issued_year'].apply(lambda x: year(x))
loan_data['credit_earliest_year'] = loan_data['credit_earliest_year'].apply(lambda x: year(x))
In [38]:
loan_data.head()
Out[38]:
interest_rate loan_requested loan_funded investor_funded payment_numbers loan_grade job_title years_employed home_ownership annual_income income_verification loan_category loan_title state debt_payment_ratio delinquency_frequency creditor_inquiries open_credit_line_count derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status issued_month issued_year credit_earliest_month credit_earliest_year
0 11.89% $25,000 $25,000 $19,080 36 months B NaN < 1 year RENT 85000.0 VERIFIED - income debt_consolidation Debt consolidation for on-time payer CA 19.48 0.0 0.0 10.0 0.0 28854.0 52.10% 42.0 f 8 2009 2 1994
1 10.71% $7,000 $7,000 $673 36 months B CNN < 1 year RENT 65000.0 not verified credit_card Credit Card payoff NY 14.29 0.0 0.0 7.0 0.0 33623.0 76.70% 7.0 f 5 2008 10 2000
2 16.99% $25,000 $25,000 $24,725 36 months D Web Programmer 1 year RENT 70000.0 VERIFIED - income debt_consolidation mlue NY 10.50 0.0 0.0 10.0 0.0 19878.0 66.30% 17.0 f 8 2014 6 2000
3 13.11% $1,200 $1,200 $1,200 36 months C city of beaumont texas 10+ years OWN 54000.0 not verified debt_consolidation zxcvb TX 5.47 0.0 0.0 5.0 0.0 2584.0 40.40% 31.0 f 3 2010 1 1985
4 13.57% $10,800 $10,800 $10,692 36 months C State Farm Insurance 6 years RENT 32000.0 not verified debt_consolidation Nicolechr1978 CT 11.63 0.0 1.0 14.0 0.0 3511.0 25.60% 40.0 f 11 2009 12 1996
In [39]:
countplt, ax = plt.subplots(figsize = (10,7))

ax = sns.countplot(x='issued_month', 
              data=loan_data,
             order = loan_data['issued_month'].value_counts().index)
  • From above, we can see that July is the month that has the highest amount of issued loan.
In [40]:
countplt, ax = plt.subplots(figsize = (10,7))

ax = sns.countplot(x='issued_year', 
              data=loan_data,
             order = loan_data['issued_year'].value_counts().index)
  • From above, we can see that 2014 is the year that has the highest amount of issued loan.
In [41]:
countplt, ax = plt.subplots(figsize = (10,7))

ax = sns.countplot(x='credit_earliest_month', 
              data=loan_data,
             order = loan_data['credit_earliest_month'].value_counts().index)
In [42]:
countplt, ax = plt.subplots(figsize = (10,7))

ax = sns.countplot(x='credit_earliest_year', 
              data=loan_data,
             order = loan_data['credit_earliest_year'].value_counts().index[:10])
In [43]:
loan_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 338989 entries, 0 to 399999
Data columns (total 27 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   interest_rate                   338989 non-null  object 
 1   loan_requested                  338989 non-null  object 
 2   loan_funded                     338989 non-null  object 
 3   investor_funded                 338989 non-null  object 
 4   payment_numbers                 338989 non-null  object 
 5   loan_grade                      287123 non-null  object 
 6   job_title                       318733 non-null  object 
 7   years_employed                  324195 non-null  object 
 8   home_ownership                  287030 non-null  object 
 9   annual_income                   287238 non-null  float64
 10  income_verification             338989 non-null  object 
 11  loan_category                   338989 non-null  object 
 12  loan_title                      338973 non-null  object 
 13  state                           338989 non-null  object 
 14  debt_payment_ratio              338989 non-null  float64
 15  delinquency_frequency           338989 non-null  float64
 16  creditor_inquiries              338989 non-null  float64
 17  open_credit_line_count          338989 non-null  float64
 18  derogatory_public_record_count  338989 non-null  float64
 19  total_credit                    338989 non-null  float64
 20  credit_utilization_rate         338765 non-null  object 
 21  total_credit_line_count         338989 non-null  float64
 22  initial_status                  338989 non-null  object 
 23  issued_month                    338989 non-null  int64  
 24  issued_year                     338989 non-null  int64  
 25  credit_earliest_month           338989 non-null  int64  
 26  credit_earliest_year            338989 non-null  int64  
dtypes: float64(8), int64(4), object(15)
memory usage: 72.4+ MB
From above, we can see that data type of some variables are inproper and need to be transformed.¶
  • For loan_requested, loan_funded, investor_funded --> need to remove dollar sign and transform to numerical variables.
In [44]:
for col in ['loan_requested', 'loan_funded', 'investor_funded']:
    loan_data[col] = loan_data[col].str.slice(start=1).str.replace(",","").astype("float")
  • For interest_rate, credit_utilization_rate --> remove % and transform them to numerical variables
In [45]:
for col in ['interest_rate', 'credit_utilization_rate']:
    loan_data[col] = loan_data[col].str.slice(stop=-1).astype("float")
Plot target variable "interest rate" 's distribution¶
In [46]:
loan_data.hist(column = 'interest_rate')
Out[46]:
array([[<AxesSubplot:title={'center':'interest_rate'}>]], dtype=object)
  • Interest rate seem's normally distributed and thus linear regression model can be applied
In [47]:
loan_data['interest_rate'].describe()
Out[47]:
count    338989.000000
mean         13.946289
std           4.377945
min           5.420000
25%          10.990000
50%          13.680000
75%          16.780000
max          26.060000
Name: interest_rate, dtype: float64
In [49]:
loan_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 338989 entries, 0 to 399999
Data columns (total 27 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   interest_rate                   338989 non-null  float64
 1   loan_requested                  338989 non-null  float64
 2   loan_funded                     338989 non-null  float64
 3   investor_funded                 338989 non-null  float64
 4   payment_numbers                 338989 non-null  object 
 5   loan_grade                      287123 non-null  object 
 6   job_title                       318733 non-null  object 
 7   years_employed                  324195 non-null  object 
 8   home_ownership                  287030 non-null  object 
 9   annual_income                   287238 non-null  float64
 10  income_verification             338989 non-null  object 
 11  loan_category                   338989 non-null  object 
 12  loan_title                      338973 non-null  object 
 13  state                           338989 non-null  object 
 14  debt_payment_ratio              338989 non-null  float64
 15  delinquency_frequency           338989 non-null  float64
 16  creditor_inquiries              338989 non-null  float64
 17  open_credit_line_count          338989 non-null  float64
 18  derogatory_public_record_count  338989 non-null  float64
 19  total_credit                    338989 non-null  float64
 20  credit_utilization_rate         338765 non-null  float64
 21  total_credit_line_count         338989 non-null  float64
 22  initial_status                  338989 non-null  object 
 23  issued_month                    338989 non-null  int64  
 24  issued_year                     338989 non-null  int64  
 25  credit_earliest_month           338989 non-null  int64  
 26  credit_earliest_year            338989 non-null  int64  
dtypes: float64(13), int64(4), object(10)
memory usage: 72.4+ MB
loan_requested¶
In [50]:
loan_data.loan_requested.describe()
Out[50]:
count    338989.000000
mean      14276.968353
std        8252.264104
min         500.000000
25%        8000.000000
50%       12000.000000
75%       20000.000000
max       35000.000000
Name: loan_requested, dtype: float64
In [51]:
sns.distplot(loan_data[loan_data['loan_requested'].notnull()]['loan_requested'])
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/454341747.py:1: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(loan_data[loan_data['loan_requested'].notnull()]['loan_requested'])
Out[51]:
<AxesSubplot:xlabel='loan_requested', ylabel='Density'>
  • For loan_requested, there is only one missing value. I will use median to fill in the missing value so that this variable won't be affected by outlier and the distribution will still be similar.
In [52]:
loan_data['loan_requested'].fillna(loan_data['loan_requested'].median(), inplace=True)
In [53]:
sns.distplot(loan_data[loan_data['loan_requested'].notnull()]['loan_requested'])
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/454341747.py:1: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(loan_data[loan_data['loan_requested'].notnull()]['loan_requested'])
Out[53]:
<AxesSubplot:xlabel='loan_requested', ylabel='Density'>
In [54]:
loan_data.loan_requested.describe()
Out[54]:
count    338989.000000
mean      14276.968353
std        8252.264104
min         500.000000
25%        8000.000000
50%       12000.000000
75%       20000.000000
max       35000.000000
Name: loan_requested, dtype: float64
  • From above table and histrgram, we can see that the overall quantile is about the same to the original values.
loan_funded¶
In [55]:
loan_data.loan_funded.describe()
Out[55]:
count    338989.000000
mean      14248.306140
std        8238.064335
min         500.000000
25%        8000.000000
50%       12000.000000
75%       20000.000000
max       35000.000000
Name: loan_funded, dtype: float64
In [56]:
sns.distplot(loan_data[loan_data['loan_funded'].notnull()]['loan_funded'])
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/3528064487.py:1: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(loan_data[loan_data['loan_funded'].notnull()]['loan_funded'])
Out[56]:
<AxesSubplot:xlabel='loan_funded', ylabel='Density'>
In [57]:
loan_data['loan_funded'].fillna(loan_data['loan_funded'].median(), inplace=True)
  • For loan_funded, there is only one missing value. I will use median to fill in the missing value so that this variable won't be affected by outlier and the distribution will still be similar.
In [58]:
loan_data.loan_funded.describe()
Out[58]:
count    338989.000000
mean      14248.306140
std        8238.064335
min         500.000000
25%        8000.000000
50%       12000.000000
75%       20000.000000
max       35000.000000
Name: loan_funded, dtype: float64
  • From above table and histrgram, we can see that the overall quantile is about the same to the original values.
investor_funded¶
In [59]:
loan_data.investor_funded.describe()
Out[59]:
count    338989.000000
mean      14182.000445
std        8255.138412
min           0.000000
25%        8000.000000
50%       12000.000000
75%       19900.000000
max       35000.000000
Name: investor_funded, dtype: float64
In [60]:
sns.distplot(loan_data[loan_data['investor_funded'].notnull()]['investor_funded'])
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/2461909109.py:1: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(loan_data[loan_data['investor_funded'].notnull()]['investor_funded'])
Out[60]:
<AxesSubplot:xlabel='investor_funded', ylabel='Density'>
In [61]:
loan_data['investor_funded'].fillna(loan_data['investor_funded'].median(), inplace=True)
  • For investor_funded, there is only one missing value. I will use median to fill in the missing value so that this variable won't be affected by outlier and the distribution will still be similar.
In [62]:
loan_data.investor_funded.describe()
Out[62]:
count    338989.000000
mean      14182.000445
std        8255.138412
min           0.000000
25%        8000.000000
50%       12000.000000
75%       19900.000000
max       35000.000000
Name: investor_funded, dtype: float64
  • From above table and histrgram, we can see that the overall quantile is about the same to the original values.
For payment_numbers --> since the majority (73%) is 36 months, I will create a new variable is_36months¶
In [63]:
loan_data.payment_numbers.value_counts()
Out[63]:
 36 months    247791
 60 months     91198
Name: payment_numbers, dtype: int64
In [64]:
247791/(247791+ 91198)
Out[64]:
0.7309706214655933
In [65]:
loan_data['is_36months'] = loan_data['payment_numbers'].apply(lambda x: 1 if x == '36 months' else 0)
In [66]:
loan_data = loan_data.drop('payment_numbers', axis=1)
loan_grade¶
In [67]:
loan_data.loan_grade.value_counts()
Out[67]:
B    86121
C    76446
D    46984
A    45525
E    21628
F     8395
G     2024
Name: loan_grade, dtype: int64
In [68]:
sns.countplot(x='loan_grade', data=loan_data)
Out[68]:
<AxesSubplot:xlabel='loan_grade', ylabel='count'>
  • Since there are around 15% of missing values, it's better to impute those missing values with another category rather than imputing them with the most common category
In [69]:
loan_data.loan_grade = loan_data.loan_grade.fillna('Other')
In [70]:
loan_data.loan_grade.value_counts()
Out[70]:
B        86121
C        76446
Other    51866
D        46984
A        45525
E        21628
F         8395
G         2024
Name: loan_grade, dtype: int64
job_title¶
In [72]:
len(loan_data.job_title.value_counts())
Out[72]:
163395
In [73]:
loan_data.job_title = loan_data.job_title.fillna('Other')
In [74]:
loan_data = loan_data.drop('job_title', axis=1)
  • there are too many job titles, if would be too complicated to include all of them in the model. Will consider to use clustering techniques such as K-means, DCSCAN, hirarchical clustering to do it if the model performance is not good enough
years_employed¶
In [75]:
loan_data.years_employed.value_counts()
Out[75]:
10+ years    108491
2 years       30117
3 years       26670
< 1 year      26003
5 years       23072
1 year        21432
4 years       20259
6 years       19601
7 years       19445
8 years       16212
9 years       12893
Name: years_employed, dtype: int64
  • change data type to float, replace unnecessary strings
  • replace na using interpolate()
In [76]:
loan_data['years_employed'] = loan_data['years_employed'].str.replace("< 1", "0").replace("+", "")
In [77]:
loan_data['years_employed'] = loan_data['years_employed'].str.strip()
In [78]:
loan_data['years_employed'] = loan_data['years_employed'].str.slice(stop=2)
In [79]:
loan_data.years_employed.value_counts()
Out[79]:
10    108491
2      30117
3      26670
0      26003
5      23072
1      21432
4      20259
6      19601
7      19445
8      16212
9      12893
Name: years_employed, dtype: int64
In [80]:
# loan_data['years_employed'] = loan_data['years_employed'].apply(lambda x: x.replace("< 1", "0").replace("+", "").strip()[:2])
In [81]:
loan_data.years_employed.unique()
Out[81]:
array(['0 ', '1 ', '10', '6 ', '9 ', '3 ', '5 ', '2 ', '4 ', '8 ', '7 ',
       nan], dtype=object)
In [82]:
sns.histplot(loan_data[loan_data["years_employed"].notnull()]["years_employed"])
plt.show()
loan_data["years_employed"].describe()
Out[82]:
count     324195
unique        11
top           10
freq      108491
Name: years_employed, dtype: object
In [83]:
loan_data['years_employed'] = loan_data['years_employed'].astype('float')
In [84]:
sns.histplot(loan_data[loan_data["years_employed"].notnull()]["years_employed"])
plt.show()
loan_data["years_employed"].describe()
Out[84]:
count    324195.000000
mean          5.991567
std           3.613908
min           0.000000
25%           3.000000
50%           6.000000
75%          10.000000
max          10.000000
Name: years_employed, dtype: float64
  • There are 4% of missing values for years_employed --> interpolate() to fill in the missing values
In [85]:
loan_data['years_employed'] = loan_data['years_employed'].interpolate()
In [86]:
sns.histplot(loan_data[loan_data["years_employed"].notnull()]["years_employed"])
plt.show()
loan_data["years_employed"].describe()
Out[86]:
count    338989.000000
mean          5.996765
std           3.574865
min           0.000000
25%           3.000000
50%           6.000000
75%          10.000000
max          10.000000
Name: years_employed, dtype: float64
  • Check the bar plot and see data's distribution and quantile are similar to the original one
  • Create a new variable that regroup years_employed into 3 groups
    1. for years employed 0 - 5 --> G1
    2. for years employed 6 - 10 --> G2
    3. for years employed 10 + --> G3
In [87]:
year_condition = [(loan_data['years_employed'] >= 0) & (loan_data['years_employed'] < 6), 
                  (loan_data['years_employed'] >= 6) & (loan_data['years_employed'] < 10),
                 loan_data['years_employed'] >= 10]
year_em_categories = ['G1', 'G2', 'G3']
loan_data['YE_categories'] = np.select(year_condition, year_em_categories)
In [88]:
sns.countplot(x='YE_categories', data=loan_data)
Out[88]:
<AxesSubplot:xlabel='YE_categories', ylabel='count'>
In [89]:
loan_data.YE_categories.value_counts()
Out[89]:
G1    154032
G3    110301
G2     74656
Name: YE_categories, dtype: int64
  • From above, we can see that G1(employed_years from 0 to 5 years) is the majority among the group.
In [90]:
loan_data["YE_categories"] = loan_data["YE_categories"].astype("category")
In [91]:
loan_data = loan_data.drop('years_employed', axis=1)
home_ownership¶
In [92]:
loan_data.home_ownership.value_counts()
Out[92]:
MORTGAGE    145958
RENT        115958
OWN          24976
OTHER          107
NONE            30
ANY              1
Name: home_ownership, dtype: int64
In [93]:
sns.countplot(x='home_ownership', data=loan_data)
Out[93]:
<AxesSubplot:xlabel='home_ownership', ylabel='count'>
  • From metadata, we know that there should be only 4 status for home_ownership, therefore I combine none and any into other. I also replace na with other.
In [94]:
loan_data["home_ownership"] = loan_data["home_ownership"].str.replace("NONE", "OTHER").str.replace("ANY", "OTHER")
In [95]:
loan_data["home_ownership"]  = loan_data["home_ownership"] .fillna("OTHER")
In [96]:
loan_data["home_ownership"].value_counts()
Out[96]:
MORTGAGE    145958
RENT        115958
OTHER        52097
OWN          24976
Name: home_ownership, dtype: int64
In [97]:
sns.countplot(x='home_ownership', data=loan_data)
Out[97]:
<AxesSubplot:xlabel='home_ownership', ylabel='count'>
  • We can see from above plot that mortgage is the majority.
annual income¶
In [98]:
loan_data.annual_income.describe()
Out[98]:
count    2.872380e+05
mean     7.315146e+04
std      5.618967e+04
min      3.000000e+03
25%      4.500000e+04
50%      6.300000e+04
75%      8.807875e+04
max      7.500000e+06
Name: annual_income, dtype: float64
In [99]:
sns.distplot(loan_data[loan_data['annual_income'].notnull()]['annual_income'])
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/3601847436.py:1: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(loan_data[loan_data['annual_income'].notnull()]['annual_income'])
Out[99]:
<AxesSubplot:xlabel='annual_income', ylabel='Density'>
In [100]:
len(loan_data.annual_income.unique())
Out[100]:
22155
In [101]:
max(loan_data.annual_income)
Out[101]:
7500000.0
In [102]:
min(loan_data.annual_income)
Out[102]:
3000.0
In [103]:
loan_data.annual_income.mean()
Out[103]:
73151.46409611532
In [104]:
loan_data.annual_income.median()
Out[104]:
63000.0
  • There are 15% of missing values --> use interpolate()
In [105]:
loan_data.annual_income = loan_data.annual_income.interpolate()
In [106]:
loan_data.annual_income.describe()
Out[106]:
count    3.389890e+05
mean     7.311117e+04
std      5.369898e+04
min      3.000000e+03
25%      4.600000e+04
50%      6.400000e+04
75%      8.750000e+04
max      7.500000e+06
Name: annual_income, dtype: float64
In [107]:
sns.distplot(loan_data[loan_data['annual_income'].notnull()]['annual_income'])
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/3601847436.py:1: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(loan_data[loan_data['annual_income'].notnull()]['annual_income'])
Out[107]:
<AxesSubplot:xlabel='annual_income', ylabel='Density'>
  • From above plot and table, we can see that the distribution is silimar after filling the missing value.
create different bins for annual income¶
In [108]:
income_condition = [(loan_data['annual_income'] >= 0) & (loan_data['annual_income'] < 100000), 
                  (loan_data['annual_income'] >= 100000) & (loan_data['annual_income'] < 200000),
                  (loan_data['annual_income'] >= 200000) & (loan_data['annual_income'] < 300000),
                  (loan_data['annual_income'] >= 300000) & (loan_data['annual_income'] < 400000),
                  (loan_data['annual_income'] >= 400000) & (loan_data['annual_income'] < 500000),
                  (loan_data['annual_income'] >= 500000) & (loan_data['annual_income'] < 600000),
                  (loan_data['annual_income'] >= 600000) & (loan_data['annual_income'] < 700000),
                 loan_data['annual_income'] >= 700000]
income_categories = ['G1', 'G2', 'G3', 'G4', 'G5', 'G6', 'G7', 'G8']
loan_data['income_categories'] = np.select(income_condition, income_categories)
In [109]:
loan_data['income_categories'].value_counts()
Out[109]:
G1    278962
G2     54035
G3      4409
G4       933
G5       306
G6       145
G8       141
G7        58
Name: income_categories, dtype: int64
In [110]:
sns.countplot(x='income_categories', data=loan_data)
Out[110]:
<AxesSubplot:xlabel='income_categories', ylabel='count'>
  • From above, we can see that most people has annual income less than 100000. Therefore, I decide to create a variable income_over_100000
In [111]:
loan_data.groupby('income_categories')[['income_categories', 'interest_rate']].mean()
Out[111]:
interest_rate
income_categories
G1 14.020590
G2 13.591988
G3 13.611095
G4 13.770472
G5 14.598562
G6 13.741517
G7 13.613621
G8 13.299362
In [112]:
loan_data['income_over_100000'] = loan_data['income_categories'].apply(lambda x:1 if x =='G1' else 0)
In [113]:
loan_data['income_over_100000'].value_counts()
Out[113]:
1    278962
0     60027
Name: income_over_100000, dtype: int64
In [114]:
# drop income categories
loan_data = loan_data.drop(columns = ['income_categories'])
Income verification¶
In [115]:
loan_data['income_verification'].value_counts()
Out[115]:
VERIFIED - income           127040
not verified                107873
VERIFIED - income source    104076
Name: income_verification, dtype: int64
  • Assume that income source verification means that entity providing the income and are verified and income verified means that given proof of income and are verified. Both of these 2 catogories mean that the income is verified, therefore, I will combine these 2 categoties into one --> income verified.
  • I will fill in the missing value with income verified since it's the majority and there is only 1 missing value.
In [116]:
loan_data["income_verification"] = loan_data["income_verification"].str.replace("VERIFIED - income", "verified").str.replace("verified source","verified")
loan_data["income_verification"]  = loan_data["income_verification"] .fillna("verified")
loan_data['income_verification'].value_counts()
Out[116]:
verified        231116
not verified    107873
Name: income_verification, dtype: int64
In [117]:
sns.countplot(x='income_verification', data=loan_data)
Out[117]:
<AxesSubplot:xlabel='income_verification', ylabel='count'>
loan category¶
In [118]:
loan_data['loan_category'].value_counts()
Out[118]:
debt_consolidation    198226
credit_card            75680
home_improvement       19625
other                  17154
major_purchase          7312
small_business          5359
car                     4115
medical                 3329
moving                  2138
wedding                 1934
vacation                1848
house                   1723
educational              279
renewable_energy         267
Name: loan_category, dtype: int64
In [119]:
countplt, ax = plt.subplots(figsize = (10,7))

ax = sns.countplot(x='loan_category', 
              data=loan_data,
             order = loan_data['loan_category'].value_counts().index[:5])
  • Top 5 loan category includes: debt_consolidation, credit_card, home_improvement, other, and major_purchase.
In [120]:
loan_data.groupby('loan_category')[['loan_category', 'interest_rate']].mean()
Out[120]:
interest_rate
loan_category
car 12.044260
credit_card 12.740792
debt_consolidation 14.228413
educational 11.476344
home_improvement 13.623039
house 15.048247
major_purchase 12.826208
medical 15.444302
moving 16.084233
other 15.889428
renewable_energy 15.028652
small_business 15.892461
vacation 15.280622
wedding 14.144917
  • each loan_category's mean interest rate have similar values.
loan_title¶
In [121]:
loan_data['loan_title'] = loan_data['loan_title'].astype('str').apply(lambda x: x.lower())
In [325]:
# loan_data['loan_title'].value_counts()
In [123]:
len(loan_data['loan_title'].value_counts())
Out[123]:
45276
In [124]:
# Debt consolidation  
123592/338973
Out[124]:
0.3646072106037944
In [125]:
# Credit card refinancing 
40820/338973
Out[125]:
0.12042257052921619

There are 45276 different loan title for this variable, I will drop this variable. Will consider using clustering techniques such as K-means, DCSCAN, hirarchical clustering to do it if the model performance is not good enough

In [126]:
loan_data = loan_data.drop(columns = ['loan_title'])
state¶
In [128]:
loan_data['state'].value_counts()
Out[128]:
CA    52835
NY    29226
TX    26493
FL    22756
IL    13483
NJ    13188
PA    11877
OH    11040
GA    10851
VA    10338
NC     9303
MI     8350
MA     8038
MD     8016
AZ     7747
WA     7706
CO     7119
MN     5865
MO     5398
CT     5243
NV     4753
IN     4608
OR     4411
WI     4245
TN     4215
AL     4188
LA     4017
SC     3979
KY     3173
KS     3096
OK     3014
AR     2532
UT     2531
NM     1848
HI     1798
WV     1738
NH     1648
RI     1486
DC     1080
MT      995
AK      947
DE      896
WY      853
SD      730
MS      707
VT      604
ID        8
IA        7
NE        6
ME        4
Name: state, dtype: int64
In [129]:
len(loan_data['state'].value_counts())
Out[129]:
50
In [131]:
states = {
    "AK": "Other",
    "AL": "South",
    "AR": "South",
    "AS": "Other",
    "AZ": "West",
    "CA": "West",
    "CO": "West",
    "CT": "Northeast",
    "DC": "Northeast",
    "DE": "Northeast",
    "FL": "South",
    "GA": "South",
    "GU": "Other",
    "HI": "Other",
    "IA": "Midwest",
    "ID": "West",
    "IL": "Midwest",
    "IN": "Midwest",
    "KS": "Midwest",
    "KY": "South",
    "LA": "South",
    "MA": "Northeast",
    "MD": "Northeast",
    "ME": "Northeast",
    "MI": "West",
    "MN": "Midwest",
    "MO": "Midwest",
    "MP": "Other",
    "MS": "South",
    "MT": "West",
    "NA": "Other",
    "NC": "South",
    "ND": "Midwest",
    "NE": "West",
    "NH": "Northeast",
    "NJ": "Northeast",
    "NM": "West",
    "NV": "West",
    "NY": "Northeast",
    "OH": "Midwest",
    "OK": "South",
    "OR": "West",
    "PA": "Northeast",
    "PR": "Other",
    "RI": "Northeast",
    "SC": "South",
    "SD": "Midwest",
    "TN": "South",
    "TX": "South",
    "UT": "West",
    "VA": "South",
    "VI": "Other",
    "VT": "Northeast",
    "WA": "West",
    "WI": "Midwest",
    "WV": "South",
    "WY": "West",
}


loan_data["region"] = loan_data["state"].map(states)

loan_data = loan_data.drop("state", axis=1)
In [132]:
loan_data.head()
Out[132]:
interest_rate loan_requested loan_funded investor_funded loan_grade home_ownership annual_income income_verification loan_category debt_payment_ratio delinquency_frequency creditor_inquiries open_credit_line_count derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status issued_month issued_year credit_earliest_month credit_earliest_year is_36months YE_categories income_over_100000 region
0 11.89 25000.0 25000.0 19080.0 B RENT 85000.0 verified debt_consolidation 19.48 0.0 0.0 10.0 0.0 28854.0 52.1 42.0 f 8 2009 2 1994 0 G1 1 West
1 10.71 7000.0 7000.0 673.0 B RENT 65000.0 not verified credit_card 14.29 0.0 0.0 7.0 0.0 33623.0 76.7 7.0 f 5 2008 10 2000 0 G1 1 Northeast
2 16.99 25000.0 25000.0 24725.0 D RENT 70000.0 verified debt_consolidation 10.50 0.0 0.0 10.0 0.0 19878.0 66.3 17.0 f 8 2014 6 2000 0 G1 1 Northeast
3 13.11 1200.0 1200.0 1200.0 C OWN 54000.0 not verified debt_consolidation 5.47 0.0 0.0 5.0 0.0 2584.0 40.4 31.0 f 3 2010 1 1985 0 G3 1 South
4 13.57 10800.0 10800.0 10692.0 C RENT 32000.0 not verified debt_consolidation 11.63 0.0 1.0 14.0 0.0 3511.0 25.6 40.0 f 11 2009 12 1996 0 G2 1 Northeast
In [133]:
loan_data["region"].value_counts()
Out[133]:
South        107304
West          99162
Northeast     81306
Midwest       48472
Other          2745
Name: region, dtype: int64
  • Southern states stands majority among all other states.

For continuous variables, plot histogram to see its distribution¶

  • if it's not normal distribution --> do transformation to make it distributed normally.
debt_payment_ratio¶
In [134]:
loan_data.hist(column = 'debt_payment_ratio')
Out[134]:
array([[<AxesSubplot:title={'center':'debt_payment_ratio'}>]],
      dtype=object)
In [135]:
loan_data['debt_payment_ratio'].describe()
Out[135]:
count    338989.000000
mean         16.997242
std           7.697223
min           0.000000
25%          11.250000
50%          16.700000
75%          22.500000
max          39.990000
Name: debt_payment_ratio, dtype: float64
  • From above histogram we can see that debt_payment_ratio follows normal distribution with right tail slightly skewed. (consider to remove values > 36 later if model performance is not good)
In [136]:
len(loan_data[loan_data['debt_payment_ratio']>36])
Out[136]:
845
delinquency_frequency¶
In [137]:
loan_data.hist(column = 'delinquency_frequency')
Out[137]:
array([[<AxesSubplot:title={'center':'delinquency_frequency'}>]],
      dtype=object)
In [138]:
loan_data['delinquency_frequency'].describe()
Out[138]:
count    338989.000000
mean          0.274286
std           0.779294
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          29.000000
Name: delinquency_frequency, dtype: float64
In [139]:
len(loan_data[loan_data['delinquency_frequency'] == 0])
Out[139]:
280249
In [140]:
280249/338989
Out[140]:
0.8267200410632793
  • Over 82% data in delinquency_frequency = 0, meaning that over 82% of people didn't have 30+ incidences of delinquency in the borrower's credit file for the past 2 years. Therefore, I will create a categorical variable to determine whether a person has delinquency(1) or not(0).
In [141]:
loan_data['has_delinquency'] = loan_data['delinquency_frequency'].apply(lambda x: 1 if x != 0 else 0)
In [142]:
loan_data['has_delinquency'].value_counts()
Out[142]:
0    280249
1     58740
Name: has_delinquency, dtype: int64
In [143]:
sns.countplot(x='has_delinquency', data=loan_data)
Out[143]:
<AxesSubplot:xlabel='has_delinquency', ylabel='count'>
In [144]:
loan_data = loan_data.drop(columns = ['delinquency_frequency'])
In [145]:
# loan_data.head(5)
Creditor_inquiries¶
In [146]:
loan_data.hist(column = 'creditor_inquiries')
Out[146]:
array([[<AxesSubplot:title={'center':'creditor_inquiries'}>]],
      dtype=object)
In [147]:
loan_data['creditor_inquiries'].describe()
Out[147]:
count    338989.000000
mean          0.817911
std           1.059264
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max           8.000000
Name: creditor_inquiries, dtype: float64
In [148]:
len(loan_data[loan_data['creditor_inquiries'] == 0])
Out[148]:
172066
In [149]:
172066/338989
Out[149]:
0.5075857918693527
In [150]:
len(loan_data[(loan_data['creditor_inquiries'] >= 0) & (loan_data['creditor_inquiries'] < 3)])  
Out[150]:
311443
In [151]:
311443/338989
Out[151]:
0.9187407261002569
  • Over 50% data in creditor_inquiries = 0, meaning that over 50% of creditors didn't have inquiries during the past 6 months . Therefore, I will create a categorical variable to determine whether a creditor has inquiries(1) or not(0) in the past 6 months.
In [152]:
loan_data['has_inquiry'] = loan_data['creditor_inquiries'].apply(lambda x: 1 if x != 0 else 0)
In [153]:
loan_data['has_inquiry'].value_counts()
Out[153]:
0    172066
1    166923
Name: has_inquiry, dtype: int64
In [154]:
sns.countplot(x='has_inquiry', data=loan_data)
Out[154]:
<AxesSubplot:xlabel='has_inquiry', ylabel='count'>
In [155]:
loan_data = loan_data.drop(columns = ['creditor_inquiries'])
open_credit_line_count¶
In [156]:
loan_data.hist(column = 'open_credit_line_count')
Out[156]:
array([[<AxesSubplot:title={'center':'open_credit_line_count'}>]],
      dtype=object)
In [157]:
loan_data['open_credit_line_count'].describe()
Out[157]:
count    338989.000000
mean         11.118081
std           4.896081
min           0.000000
25%           8.000000
50%          10.000000
75%          14.000000
max          76.000000
Name: open_credit_line_count, dtype: float64
In [158]:
loan_data['log_open_credit_line'] = np.log10(loan_data['open_credit_line_count'])
loan_data.head(5)
/Users/maggiechuang/opt/anaconda3/lib/python3.9/site-packages/pandas/core/arraylike.py:397: RuntimeWarning: divide by zero encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)
Out[158]:
interest_rate loan_requested loan_funded investor_funded loan_grade home_ownership annual_income income_verification loan_category debt_payment_ratio open_credit_line_count derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status issued_month issued_year credit_earliest_month credit_earliest_year is_36months YE_categories income_over_100000 region has_delinquency has_inquiry log_open_credit_line
0 11.89 25000.0 25000.0 19080.0 B RENT 85000.0 verified debt_consolidation 19.48 10.0 0.0 28854.0 52.1 42.0 f 8 2009 2 1994 0 G1 1 West 0 0 1.000000
1 10.71 7000.0 7000.0 673.0 B RENT 65000.0 not verified credit_card 14.29 7.0 0.0 33623.0 76.7 7.0 f 5 2008 10 2000 0 G1 1 Northeast 0 0 0.845098
2 16.99 25000.0 25000.0 24725.0 D RENT 70000.0 verified debt_consolidation 10.50 10.0 0.0 19878.0 66.3 17.0 f 8 2014 6 2000 0 G1 1 Northeast 0 0 1.000000
3 13.11 1200.0 1200.0 1200.0 C OWN 54000.0 not verified debt_consolidation 5.47 5.0 0.0 2584.0 40.4 31.0 f 3 2010 1 1985 0 G3 1 South 0 0 0.698970
4 13.57 10800.0 10800.0 10692.0 C RENT 32000.0 not verified debt_consolidation 11.63 14.0 0.0 3511.0 25.6 40.0 f 11 2009 12 1996 0 G2 1 Northeast 0 1 1.146128
In [159]:
# loan_data['log_open_credit_line'].unique()
In [160]:
# len(loan_data[loan_data['log_open_credit_line'] <0])
In [161]:
loan_data.loc[loan_data['log_open_credit_line'] <0]
Out[161]:
interest_rate loan_requested loan_funded investor_funded loan_grade home_ownership annual_income income_verification loan_category debt_payment_ratio open_credit_line_count derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status issued_month issued_year credit_earliest_month credit_earliest_year is_36months YE_categories income_over_100000 region has_delinquency has_inquiry log_open_credit_line
101071 16.29 3000.0 3000.0 3000.0 C RENT 50000.000000 not verified car 0.00 0.0 0.0 0.0 NaN 5.0 f 1 2013 12 2003 0 G1 1 South 0 0 -inf
123995 22.47 3900.0 3900.0 3900.0 Other RENT 13000.000000 verified debt_consolidation 6.44 0.0 0.0 0.0 NaN 10.0 f 4 2013 4 2004 0 G1 1 Midwest 0 0 -inf
148869 18.25 16000.0 16000.0 16000.0 Other RENT 55666.666667 verified car 0.00 0.0 0.0 0.0 NaN 19.0 w 8 2013 1 1995 0 G2 1 Northeast 0 0 -inf
235849 14.98 8000.0 8000.0 8000.0 C RENT 120000.000000 verified major_purchase 0.00 0.0 0.0 0.0 NaN 6.0 f 1 2014 7 2008 0 G1 0 West 0 0 -inf
268243 21.18 3550.0 3550.0 3550.0 Other OWN 43680.000000 verified debt_consolidation 0.00 0.0 0.0 0.0 NaN 3.0 w 3 2014 12 1980 0 G1 1 South 0 0 -inf
269378 18.92 9400.0 9400.0 9400.0 Other OWN 93000.000000 verified credit_card 0.52 0.0 0.0 0.0 NaN 6.0 f 3 2014 5 1997 0 G1 1 Northeast 1 1 -inf
In [162]:
# loan_data['has_inquiry'] = loan_data['creditor_inquiries'].apply(lambda x: 1 if x != 0 else 0)
In [163]:
loan_data['log_open_credit_line'].replace([-np.inf], 0, inplace = True)
In [164]:
loan_data['log_open_credit_line'].unique()
Out[164]:
array([1.        , 0.84509804, 0.69897   , 1.14612804, 0.77815125,
       0.47712125, 1.11394335, 1.04139269, 1.30103   , 0.30103   ,
       1.20411998, 0.60205999, 1.07918125, 1.23044892, 0.90308999,
       1.17609126, 1.32221929, 0.95424251, 1.2787536 , 1.36172784,
       1.43136376, 1.38021124, 1.34242268, 1.47712125, 1.62324929,
       1.49136169, 1.50514998, 1.25527251, 1.39794001, 1.61278386,
       1.462398  , 1.44715803, 1.41497335, 1.54406804, 1.64345268,
       1.53147892, 1.5563025 , 1.5797836 , 1.63346846, 1.51851394,
       1.56820172, 1.69019608, 1.65321251, 1.59106461, 1.66275783,
       0.        , 1.69897   , 1.60205999, 1.68124124, 1.79239169,
       1.76342799, 1.70757018, 1.78532984, 1.88081359, 1.67209786,
       1.73239376, 1.74036269, 1.72427587])
In [165]:
loan_data.hist(column = 'log_open_credit_line')
Out[165]:
array([[<AxesSubplot:title={'center':'log_open_credit_line'}>]],
      dtype=object)
  • By doing log transformation, the left skewed distribution has become more normalized.
In [166]:
loan_data = loan_data.drop(['open_credit_line_count'], axis=1)
derogatory_public_record_count¶
In [167]:
loan_data.hist(column = 'derogatory_public_record_count')
Out[167]:
array([[<AxesSubplot:title={'center':'derogatory_public_record_count'}>]],
      dtype=object)
In [168]:
loan_data['derogatory_public_record_count'].describe()
Out[168]:
count    338989.000000
mean          0.152689
std           0.505363
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max          63.000000
Name: derogatory_public_record_count, dtype: float64
In [169]:
len(loan_data[loan_data['derogatory_public_record_count'] == 0])
Out[169]:
296229
In [170]:
296229/338989
Out[170]:
0.8738602137532486
  • Over 87% derogatory_public_record_count = 0. Therefore, I will create a categorical variable to determine whether there is derogatory public record(1) or not(0).
In [171]:
loan_data['derogatory_public_record'] = loan_data['derogatory_public_record_count'].apply(lambda x: 1 if x != 0 else 0)
In [172]:
sns.countplot(x='derogatory_public_record', data=loan_data)
Out[172]:
<AxesSubplot:xlabel='derogatory_public_record', ylabel='count'>
total_credit¶
In [173]:
loan_data.hist(column = 'total_credit')
Out[173]:
array([[<AxesSubplot:title={'center':'total_credit'}>]], dtype=object)
In [174]:
loan_data['total_credit'].describe()
Out[174]:
count    3.389890e+05
mean     1.597434e+04
std      1.976517e+04
min      0.000000e+00
25%      6.462000e+03
50%      1.178200e+04
75%      2.023300e+04
max      2.568995e+06
Name: total_credit, dtype: float64
In [175]:
# 40.889k
fig = px.box(loan_data, y="total_credit")
fig.show()
In [176]:
len(loan_data[loan_data['total_credit'] > 40889])
Out[176]:
15778
In [177]:
15778/338989
Out[177]:
0.04654428314783075
In [178]:
loan_data = loan_data[loan_data['total_credit'] <= 40889]
In [179]:
loan_data.hist(column = 'total_credit')
Out[179]:
array([[<AxesSubplot:title={'center':'total_credit'}>]], dtype=object)
  • After removing outliers, total_credit distribution become more normally distributed.
credit_utilization_rate¶
In [180]:
loan_data.hist(column = 'credit_utilization_rate')
Out[180]:
array([[<AxesSubplot:title={'center':'credit_utilization_rate'}>]],
      dtype=object)
In [181]:
loan_data['credit_utilization_rate'].describe()
Out[181]:
count    323009.000000
mean         55.811409
std          23.726387
min           0.000000
25%          38.900000
50%          57.200000
75%          74.300000
max         892.300000
Name: credit_utilization_rate, dtype: float64
In [182]:
loan_data['credit_utilization_rate'] = loan_data['credit_utilization_rate'].interpolate()
In [183]:
loan_data['credit_utilization_rate'].describe()
Out[183]:
count    323211.000000
mean         55.810266
std          23.722860
min           0.000000
25%          38.900000
50%          57.200000
75%          74.300000
max         892.300000
Name: credit_utilization_rate, dtype: float64
In [184]:
loan_data.hist(column = 'credit_utilization_rate')
Out[184]:
array([[<AxesSubplot:title={'center':'credit_utilization_rate'}>]],
      dtype=object)
In [185]:
fig = px.box(loan_data, y="credit_utilization_rate")
fig.show()
In [186]:
len(loan_data[loan_data['credit_utilization_rate'] > 127.4])
Out[186]:
12
In [187]:
loan_data = loan_data[loan_data['credit_utilization_rate'] <= 127.4]
In [188]:
loan_data.hist(column = 'credit_utilization_rate')
Out[188]:
array([[<AxesSubplot:title={'center':'credit_utilization_rate'}>]],
      dtype=object)
  • After removing outliers, credit_utilization_rate's distribution become more normally distributed.
In [189]:
loan_data['credit_utilization_rate'].describe()
Out[189]:
count    323199.000000
mean         55.804006
std          23.666161
min           0.000000
25%          38.900000
50%          57.200000
75%          74.300000
max         127.400000
Name: credit_utilization_rate, dtype: float64
total_credit_line_count¶
In [190]:
loan_data.hist(column = 'total_credit_line_count')
Out[190]:
array([[<AxesSubplot:title={'center':'total_credit_line_count'}>]],
      dtype=object)
In [191]:
loan_data['total_credit_line_count'].describe()
Out[191]:
count    323199.000000
mean         24.634126
std          11.356151
min           2.000000
25%          16.000000
50%          23.000000
75%          31.000000
max         118.000000
Name: total_credit_line_count, dtype: float64
In [192]:
fig = px.box(loan_data, y="total_credit_line_count")
fig.show()
In [193]:
len(loan_data[loan_data['total_credit_line_count'] > 53])
Out[193]:
6022
In [194]:
6022/323199
Out[194]:
0.018632483392584755
In [195]:
loan_data = loan_data[loan_data['total_credit_line_count'] <= 53]
In [196]:
loan_data['total_credit_line_count'].describe()
Out[196]:
count    317177.000000
mean         23.968957
std          10.348813
min           2.000000
25%          16.000000
50%          23.000000
75%          31.000000
max          53.000000
Name: total_credit_line_count, dtype: float64
In [197]:
loan_data.hist(column = 'total_credit_line_count')
Out[197]:
array([[<AxesSubplot:title={'center':'total_credit_line_count'}>]],
      dtype=object)
  • After removing outliers, total_credit_line_count's distribution become more normally distributed.
initial_status¶
In [198]:
loan_data['initial_status'].value_counts()
Out[198]:
f    218274
w     98903
Name: initial_status, dtype: int64
In [199]:
sns.countplot(x='initial_status', data=loan_data)
Out[199]:
<AxesSubplot:xlabel='initial_status', ylabel='count'>
In [200]:
loan_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 317177 entries, 0 to 399999
Data columns (total 27 columns):
 #   Column                          Non-Null Count   Dtype   
---  ------                          --------------   -----   
 0   interest_rate                   317177 non-null  float64 
 1   loan_requested                  317177 non-null  float64 
 2   loan_funded                     317177 non-null  float64 
 3   investor_funded                 317177 non-null  float64 
 4   loan_grade                      317177 non-null  object  
 5   home_ownership                  317177 non-null  object  
 6   annual_income                   317177 non-null  float64 
 7   income_verification             317177 non-null  object  
 8   loan_category                   317177 non-null  object  
 9   debt_payment_ratio              317177 non-null  float64 
 10  derogatory_public_record_count  317177 non-null  float64 
 11  total_credit                    317177 non-null  float64 
 12  credit_utilization_rate         317177 non-null  float64 
 13  total_credit_line_count         317177 non-null  float64 
 14  initial_status                  317177 non-null  object  
 15  issued_month                    317177 non-null  int64   
 16  issued_year                     317177 non-null  int64   
 17  credit_earliest_month           317177 non-null  int64   
 18  credit_earliest_year            317177 non-null  int64   
 19  is_36months                     317177 non-null  int64   
 20  YE_categories                   317177 non-null  category
 21  income_over_100000              317177 non-null  int64   
 22  region                          317177 non-null  object  
 23  has_delinquency                 317177 non-null  int64   
 24  has_inquiry                     317177 non-null  int64   
 25  log_open_credit_line            317177 non-null  float64 
 26  derogatory_public_record        317177 non-null  int64   
dtypes: category(1), float64(11), int64(9), object(6)
memory usage: 65.6+ MB
In [201]:
#loan_data_df = loan_data.drop(columns = ['job_title', 'years_employed', 'annual_income'])
For continuous variables --> plot correlational plot to see if different variables have high correlation to prevent multicollinearity¶
In [202]:
continuous_variable = ['loan_requested', 
                       'loan_funded', 
                       'investor_funded', 
                       'debt_payment_ratio',
                       'derogatory_public_record_count', 
                       'total_credit', 
                       'credit_utilization_rate', 
                       'total_credit_line_count', 
                       'log_open_credit_line']
In [203]:
con_df = loan_data[continuous_variable]
In [204]:
#plotting the heatmap for correlation
ax = sns.heatmap(con_df.corr(), annot=True)
  • From above plot, we can see that loan_request, loan_funded, and investor_funded have high correlation. Therefore, we should only select one of them in the model. Here I will select loan_requested.
In [205]:
loan_data_df = loan_data.drop(columns = ['loan_funded', 'investor_funded'])

Turn categorical variables into dummy variables¶

In [207]:
loan_data_df = pd.get_dummies(loan_data_df, columns = ['YE_categories', 
                                                       'home_ownership',
                                                       'loan_grade', 
                                                       'income_verification', 
                                                       'initial_status', 
                                                       'loan_category', 
                                                       #'issued_month', 
                                                       #'issued_year', 
                                                       #'credit_earliest_month', 
                                                       #'credit_earliest_year', 
                                                       'is_36months', 
                                                       'income_over_100000', 
                                                       'region', 
                                                       'has_delinquency', 
                                                       'has_inquiry',
                                                      'derogatory_public_record'])
In [208]:
loan_data_df.head()
Out[208]:
interest_rate loan_requested annual_income debt_payment_ratio derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count issued_month issued_year credit_earliest_month credit_earliest_year log_open_credit_line YE_categories_G1 YE_categories_G2 YE_categories_G3 home_ownership_MORTGAGE home_ownership_OTHER home_ownership_OWN home_ownership_RENT loan_grade_A loan_grade_B loan_grade_C loan_grade_D loan_grade_E loan_grade_F loan_grade_G loan_grade_Other income_verification_not verified income_verification_verified initial_status_f initial_status_w loan_category_car loan_category_credit_card loan_category_debt_consolidation loan_category_educational loan_category_home_improvement loan_category_house loan_category_major_purchase loan_category_medical loan_category_moving loan_category_other loan_category_renewable_energy loan_category_small_business loan_category_vacation loan_category_wedding is_36months_0 income_over_100000_0 income_over_100000_1 region_Midwest region_Northeast region_Other region_South region_West has_delinquency_0 has_delinquency_1 has_inquiry_0 has_inquiry_1 derogatory_public_record_0 derogatory_public_record_1
0 11.89 25000.0 85000.0 19.48 0.0 28854.0 52.1 42.0 8 2009 2 1994 1.000000 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0
1 10.71 7000.0 65000.0 14.29 0.0 33623.0 76.7 7.0 5 2008 10 2000 0.845098 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0
2 16.99 25000.0 70000.0 10.50 0.0 19878.0 66.3 17.0 8 2014 6 2000 1.000000 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0
3 13.11 1200.0 54000.0 5.47 0.0 2584.0 40.4 31.0 3 2010 1 1985 0.698970 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0
4 13.57 10800.0 32000.0 11.63 0.0 3511.0 25.6 40.0 11 2009 12 1996 1.146128 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0
In [209]:
loan_data_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 317177 entries, 0 to 399999
Data columns (total 60 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   interest_rate                     317177 non-null  float64
 1   loan_requested                    317177 non-null  float64
 2   annual_income                     317177 non-null  float64
 3   debt_payment_ratio                317177 non-null  float64
 4   derogatory_public_record_count    317177 non-null  float64
 5   total_credit                      317177 non-null  float64
 6   credit_utilization_rate           317177 non-null  float64
 7   total_credit_line_count           317177 non-null  float64
 8   issued_month                      317177 non-null  int64  
 9   issued_year                       317177 non-null  int64  
 10  credit_earliest_month             317177 non-null  int64  
 11  credit_earliest_year              317177 non-null  int64  
 12  log_open_credit_line              317177 non-null  float64
 13  YE_categories_G1                  317177 non-null  uint8  
 14  YE_categories_G2                  317177 non-null  uint8  
 15  YE_categories_G3                  317177 non-null  uint8  
 16  home_ownership_MORTGAGE           317177 non-null  uint8  
 17  home_ownership_OTHER              317177 non-null  uint8  
 18  home_ownership_OWN                317177 non-null  uint8  
 19  home_ownership_RENT               317177 non-null  uint8  
 20  loan_grade_A                      317177 non-null  uint8  
 21  loan_grade_B                      317177 non-null  uint8  
 22  loan_grade_C                      317177 non-null  uint8  
 23  loan_grade_D                      317177 non-null  uint8  
 24  loan_grade_E                      317177 non-null  uint8  
 25  loan_grade_F                      317177 non-null  uint8  
 26  loan_grade_G                      317177 non-null  uint8  
 27  loan_grade_Other                  317177 non-null  uint8  
 28  income_verification_not verified  317177 non-null  uint8  
 29  income_verification_verified      317177 non-null  uint8  
 30  initial_status_f                  317177 non-null  uint8  
 31  initial_status_w                  317177 non-null  uint8  
 32  loan_category_car                 317177 non-null  uint8  
 33  loan_category_credit_card         317177 non-null  uint8  
 34  loan_category_debt_consolidation  317177 non-null  uint8  
 35  loan_category_educational         317177 non-null  uint8  
 36  loan_category_home_improvement    317177 non-null  uint8  
 37  loan_category_house               317177 non-null  uint8  
 38  loan_category_major_purchase      317177 non-null  uint8  
 39  loan_category_medical             317177 non-null  uint8  
 40  loan_category_moving              317177 non-null  uint8  
 41  loan_category_other               317177 non-null  uint8  
 42  loan_category_renewable_energy    317177 non-null  uint8  
 43  loan_category_small_business      317177 non-null  uint8  
 44  loan_category_vacation            317177 non-null  uint8  
 45  loan_category_wedding             317177 non-null  uint8  
 46  is_36months_0                     317177 non-null  uint8  
 47  income_over_100000_0              317177 non-null  uint8  
 48  income_over_100000_1              317177 non-null  uint8  
 49  region_Midwest                    317177 non-null  uint8  
 50  region_Northeast                  317177 non-null  uint8  
 51  region_Other                      317177 non-null  uint8  
 52  region_South                      317177 non-null  uint8  
 53  region_West                       317177 non-null  uint8  
 54  has_delinquency_0                 317177 non-null  uint8  
 55  has_delinquency_1                 317177 non-null  uint8  
 56  has_inquiry_0                     317177 non-null  uint8  
 57  has_inquiry_1                     317177 non-null  uint8  
 58  derogatory_public_record_0        317177 non-null  uint8  
 59  derogatory_public_record_1        317177 non-null  uint8  
dtypes: float64(9), int64(4), uint8(47)
memory usage: 48.1 MB
In [210]:
loan_data_df = loan_data_df.drop(columns = ['credit_earliest_month', 'credit_earliest_year'])
In [211]:
loan_data_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 317177 entries, 0 to 399999
Data columns (total 58 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   interest_rate                     317177 non-null  float64
 1   loan_requested                    317177 non-null  float64
 2   annual_income                     317177 non-null  float64
 3   debt_payment_ratio                317177 non-null  float64
 4   derogatory_public_record_count    317177 non-null  float64
 5   total_credit                      317177 non-null  float64
 6   credit_utilization_rate           317177 non-null  float64
 7   total_credit_line_count           317177 non-null  float64
 8   issued_month                      317177 non-null  int64  
 9   issued_year                       317177 non-null  int64  
 10  log_open_credit_line              317177 non-null  float64
 11  YE_categories_G1                  317177 non-null  uint8  
 12  YE_categories_G2                  317177 non-null  uint8  
 13  YE_categories_G3                  317177 non-null  uint8  
 14  home_ownership_MORTGAGE           317177 non-null  uint8  
 15  home_ownership_OTHER              317177 non-null  uint8  
 16  home_ownership_OWN                317177 non-null  uint8  
 17  home_ownership_RENT               317177 non-null  uint8  
 18  loan_grade_A                      317177 non-null  uint8  
 19  loan_grade_B                      317177 non-null  uint8  
 20  loan_grade_C                      317177 non-null  uint8  
 21  loan_grade_D                      317177 non-null  uint8  
 22  loan_grade_E                      317177 non-null  uint8  
 23  loan_grade_F                      317177 non-null  uint8  
 24  loan_grade_G                      317177 non-null  uint8  
 25  loan_grade_Other                  317177 non-null  uint8  
 26  income_verification_not verified  317177 non-null  uint8  
 27  income_verification_verified      317177 non-null  uint8  
 28  initial_status_f                  317177 non-null  uint8  
 29  initial_status_w                  317177 non-null  uint8  
 30  loan_category_car                 317177 non-null  uint8  
 31  loan_category_credit_card         317177 non-null  uint8  
 32  loan_category_debt_consolidation  317177 non-null  uint8  
 33  loan_category_educational         317177 non-null  uint8  
 34  loan_category_home_improvement    317177 non-null  uint8  
 35  loan_category_house               317177 non-null  uint8  
 36  loan_category_major_purchase      317177 non-null  uint8  
 37  loan_category_medical             317177 non-null  uint8  
 38  loan_category_moving              317177 non-null  uint8  
 39  loan_category_other               317177 non-null  uint8  
 40  loan_category_renewable_energy    317177 non-null  uint8  
 41  loan_category_small_business      317177 non-null  uint8  
 42  loan_category_vacation            317177 non-null  uint8  
 43  loan_category_wedding             317177 non-null  uint8  
 44  is_36months_0                     317177 non-null  uint8  
 45  income_over_100000_0              317177 non-null  uint8  
 46  income_over_100000_1              317177 non-null  uint8  
 47  region_Midwest                    317177 non-null  uint8  
 48  region_Northeast                  317177 non-null  uint8  
 49  region_Other                      317177 non-null  uint8  
 50  region_South                      317177 non-null  uint8  
 51  region_West                       317177 non-null  uint8  
 52  has_delinquency_0                 317177 non-null  uint8  
 53  has_delinquency_1                 317177 non-null  uint8  
 54  has_inquiry_0                     317177 non-null  uint8  
 55  has_inquiry_1                     317177 non-null  uint8  
 56  derogatory_public_record_0        317177 non-null  uint8  
 57  derogatory_public_record_1        317177 non-null  uint8  
dtypes: float64(9), int64(2), uint8(47)
memory usage: 43.3 MB

3. Modeling¶

Split the data into X (features) and y (label), which is the interest rate.¶
In [212]:
X, y = loan_data_df.drop("interest_rate", axis=1), loan_data_df["interest_rate"]
Splitting X,y into training and testing set.¶
In [213]:
# Set seed for reproducibility
SEED = 1
# Split dataset into 70% train and 30% test
X_train, X_test, y_train, y_test = \
train_test_split(X, y,
test_size=0.2,
random_state=SEED)
Feature importance¶
  • Feature selection by ExtraTreesRegressor(model based). ExtraTreesRegressor helps us find the features which are most important.
In [214]:
# Feature selection by ExtraTreesRegressor(model based)

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score as acc
In [215]:
reg= ExtraTreesRegressor()
reg.fit(X_train,y_train)
Out[215]:
ExtraTreesRegressor()
In [216]:
feat_importances = pd.Series(reg.feature_importances_, index=X_train.columns)
feat_importances = feat_importances.sort_values()
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
3.1 Linear Regression¶
In [217]:
lr_model = LinearRegression()

lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)
In [218]:
lr_model.score(X_train,y_train)
Out[218]:
0.8019623655764111
In [219]:
lr_model.score(X_test,y_test)
Out[219]:
0.8010978701749751
In [220]:
print(f"intercept: {lr_model.intercept_}")

print(f"slope: {lr_model.coef_}")
intercept: -339.7272909588108
slope: [ 3.52833426e-05 -1.11312339e-06  1.28592096e-02  3.50818215e-02
 -2.32571264e-05  2.37428907e-02 -1.56538666e-02  1.31611364e-02
  1.76079131e-01  9.07423987e-01 -6.05085142e-02  1.59206121e-02
  4.45879021e-02 -1.34641621e-01 -2.39709113e-02  8.41243880e-02
  7.44881440e-02 -8.28557352e+00 -4.78924590e+00 -1.99983212e+00
  6.80154087e-01  3.39341101e+00  6.16166995e+00  7.52174087e+00
 -2.68232437e+00 -2.00093985e-01  2.00093985e-01  1.24880409e-01
 -1.24880409e-01 -3.46867066e-01 -6.74111792e-01 -3.24679936e-01
 -3.35401868e-01 -2.29655891e-01  1.58840230e-01 -2.89338518e-01
  3.79415031e-01  4.49756051e-01  4.20935081e-01  9.73173441e-02
  2.14753689e-01  3.78614840e-01  1.00422804e-01 -6.66133815e-16
 -9.33475291e-02  9.33475291e-02 -6.01882516e-03 -9.22060256e-03
  2.06086534e-02  8.03336260e-03 -1.34025882e-02 -1.50029532e-01
  1.50029532e-01 -3.06511626e-01  3.06511626e-01 -7.85082667e-02
  7.85082667e-02]
In [221]:
# Compute mse_dt
mse_lr = MSE(y_test, y_pred_lr)

# Compute rmse_dt
rmse_lr = mse_lr**(1/2)
In [222]:
from sklearn.model_selection import cross_val_score
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(lr_model, X_train, y_train, cv=10, 
                                  scoring='neg_mean_squared_error', 
                                  n_jobs=-1) 

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
In [223]:
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
lr_model.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = lr_model.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)
In [224]:
# Print rmse_dt
print("Test set RMSE of dt: {:.2f}".format(rmse_lr))
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
Test set RMSE of dt: 1.95
CV RMSE: 1.94
Train RMSE: 1.94

https://realpython.com/linear-regression-in-python/

3.2.1 Decision Tree¶
In [225]:
# Instantiate dt
dt = DecisionTreeRegressor(max_depth=8,
                           min_samples_leaf=0.13,
                           random_state=3)

# Fit dt to the training set
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

# Compute mse_dt
mse_dt = MSE(y_test, y_pred)

# Compute rmse_dt
rmse_dt = mse_dt**(1/2)
In [226]:
dt.score(X_train,y_train)
Out[226]:
0.7186956687942803
In [227]:
dt.score(X_test,y_test)
Out[227]:
0.7189822495817428
In [228]:
from sklearn.model_selection import cross_val_score
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, 
                                  scoring='neg_mean_squared_error', 
                                  n_jobs=-1) 

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
In [229]:
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)
In [230]:
# Print rmse_dt
print("Test set RMSE of dt: {:.2f}".format(rmse_dt))
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
Test set RMSE of dt: 2.32
CV RMSE: 2.31
Train RMSE: 2.31
In [231]:
# reg_decision_model=DecisionTreeRegressor()
# # fit independent varaibles to the dependent variables
# reg_decision_model.fit(X_train,y_train)
3.2.2 Decision Treee hyperparameter tuning model¶
  • hyperparameters tuning
In [232]:
parameters={"splitter":["best","random"],
            "max_depth" : [1,3,5,7,9,11,12],
           "min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
           "min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
           "max_features":["auto","log2","sqrt",None],
           "max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90] }

https://www.nbshare.io/notebook/312837011/Decision-Tree-Regression-With-Hyper-Parameter-Tuning-In-Python/

In [233]:
from sklearn.model_selection import GridSearchCV
tuning_model=GridSearchCV(dt,
                          param_grid=parameters,
                          scoring='neg_mean_squared_error',
                          cv=3,verbose=3)
In [234]:
# # Hyper parameters range intialization for tuning 
tuning_model.get_params
Out[234]:
<bound method BaseEstimator.get_params of GridSearchCV(cv=3,
             estimator=DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.13,
                                             random_state=3),
             param_grid={'max_depth': [1, 3, 5, 7, 9, 11, 12],
                         'max_features': ['auto', 'log2', 'sqrt', None],
                         'max_leaf_nodes': [None, 10, 20, 30, 40, 50, 60, 70,
                                            80, 90],
                         'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                         'min_weight_fraction_leaf': [0.1, 0.2, 0.3, 0.4, 0.5,
                                                      0.6, 0.7, 0.8, 0.9],
                         'splitter': ['best', 'random']},
             scoring='neg_mean_squared_error', verbose=3)>
In [235]:
# dir(GridSearchCV)
In [236]:
tuned_hyper_model= DecisionTreeRegressor(max_depth= 10,
                                         max_features='auto',
                                         max_leaf_nodes=50,
                                         min_samples_leaf=2,
                                         min_weight_fraction_leaf=0.05,
                                         splitter='random')
tuned_hyper_model.fit(X_train,y_train)
Out[236]:
DecisionTreeRegressor(max_depth=10, max_features='auto', max_leaf_nodes=50,
                      min_samples_leaf=2, min_weight_fraction_leaf=0.05,
                      splitter='random')
In [237]:
# prediction 
tuned_pred=tuned_hyper_model.predict(X_test)
In [238]:
tuned_hyper_model.score(X_train,y_train)
Out[238]:
0.7378913322230546
In [239]:
tuned_hyper_model.score(X_test,y_test)
Out[239]:
0.736550612644119
In [240]:
# Compute mse_dt
mse_ht = MSE(y_test, tuned_pred)

# Compute rmse_dt
rmse_ht = mse_dt**(1/2)

from sklearn.model_selection import cross_val_score
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(tuned_hyper_model, X_train, y_train, cv=10, 
                                  scoring='neg_mean_squared_error', 
                                  n_jobs=-1) 

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)


# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
tuned_pred_train = tuned_hyper_model.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, tuned_pred_train))**(1/2)
In [241]:
# Print rmse_dt
print("Test set RMSE of dt: {:.2f}".format(rmse_ht))
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
Test set RMSE of dt: 2.32
CV RMSE: 2.23
Train RMSE: 2.23
In [242]:
# # With hyperparameter tuned 
# print('MAE:', metrics.mean_absolute_error(y_test,tuned_pred))
# print('MSE:', metrics.mean_squared_error(y_test, tuned_pred))
# print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, tuned_pred)))
3.3.1 Random Forest¶
In [243]:
rf = RandomForestRegressor(random_state = 42)
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())
Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}
In [244]:
# Fit 'rf' to the training set
rf.fit(X_train, y_train)

# predict the train set labels
y_pred_train=rf.predict(X_train)


# Predict the test set labels 'y_pred_test'
y_pred_test = rf.predict(X_test)


# Evaluate the train set RMSE
rmse_train = MSE(y_train, y_pred_train)**(1/2)

# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)

# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(rf, X_train, y_train, cv=10, 
                                  scoring='neg_mean_squared_error', 
                                  n_jobs=-1) 

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
In [245]:
# Print rmse_dt
print("Test set RMSE of rf: {:.2f}".format(rmse_test))
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
Test set RMSE of rf: 2.32
CV RMSE: 1.58
Train RMSE: 2.23
In [246]:
rf.score(X_train,y_train)
Out[246]:
0.9815236633541242
In [247]:
rf.score(X_train,y_train)
Out[247]:
0.9815236633541242
In [248]:
# Parameters currently in use:

# {'bootstrap': True,
#  'criterion': 'mse',
#  'max_depth': None,
#  'max_features': 'auto',
#  'max_leaf_nodes': None,
#  'min_impurity_decrease': 0.0,
#  'min_impurity_split': None,
#  'min_samples_leaf': 1,
#  'min_samples_split': 2,
#  'min_weight_fraction_leaf': 0.0,
#  'n_estimators': 10,
#  'n_jobs': 1,
#  'oob_score': False,
#  'random_state': 42,
#  'verbose': 0,
#  'warm_start': False}
3.3.2 Random Forest - hyperparameter tuning model¶
  • the most important settings are the number of trees in the forest (n_estimators) and the number of features considered for splitting at each leaf node (max_features)
  • n_estimators = number of trees in the foreset
  • max_features = max number of features considered for splitting a node
  • max_depth = max number of levels in each decision tree
  • min_samples_split = min number of data points placed in a node before the node is split
  • min_samples_leaf = min number of data points allowed in a leaf node
  • bootstrap = method for sampling data points (with or without replacement)
In [249]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
In [250]:
print(random_grid)
{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}
In [251]:
# Instantiate a random forests regressor 'rf' 400 estimators
SEED = 1
rf = RandomForestRegressor(n_estimators=400,
min_samples_leaf=0.12,
random_state=SEED)

# Fit 'rf' to the training set
rf.fit(X_train, y_train)

# predict the train set labels
y_pred_train=rf.predict(X_train)


# Predict the test set labels 'y_pred_test'
y_pred_test = rf.predict(X_test)


# Evaluate the train set RMSE
rmse_train = MSE(y_train, y_pred_train)**(1/2)

# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)

# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(rf, X_train, y_train, cv=10, 
                                  scoring='neg_mean_squared_error', 
                                  n_jobs=-1) 

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
In [252]:
# prediction 
rf.score(X_train,y_train)
Out[252]:
0.22021748614548964
In [253]:
rf.score(X_test,y_test)
Out[253]:
0.21763652870954786
In [254]:
# Print rmse_dt
print("Test set RMSE of rf: {:.2f}".format(rmse_test))
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
Test set RMSE of rf: 2.32
CV RMSE: 3.85
Train RMSE: 2.23

https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

3.4.1 Gradient Boosting¶
In [255]:
# Instantiate gb
gb = GradientBoostingRegressor(max_depth=4,
                               n_estimators=200,
                               random_state=2)

# Fit gb to the training set
gb.fit(X_train, y_train)

# predict the train set labels
y_pred_gb_train=gb.predict(X_train)

# Predict test set labels
y_pred_gb_test = gb.predict(X_test)

# Evaluate the train set RMSE
rmse_train = MSE(y_train, y_pred_gb_train)**(1/2)

# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred_gb_test)**(1/2)

# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(gb, X_train, y_train, cv=10, 
                                  scoring='neg_mean_squared_error', 
                                  n_jobs=-1) 

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
In [256]:
# Print rmse_dt
print("Test set RMSE of rf: {:.2f}".format(rmse_test))
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
Test set RMSE of rf: 1.58
CV RMSE: 1.57
Train RMSE: 2.23
In [257]:
# predict on 
gb.score(X_train,y_train)
Out[257]:
0.8730305534626144
In [258]:
gb.score(X_test,y_test)
Out[258]:
0.8689728770713843
3.4.2 Gradient Boosting hyperparameter tuning model¶
In [ ]:
 

Summary¶

  • I applied linear regression, decision tree, random forest and gradient boosting and then fine tune the hyperparameter to address overfitting/underfitting problem and to acheive better results.

  • Linear regression: is used when the output variable is continuous variable. Here, I used it as a baseline model to compare with other models. The score(R square) for traning and testing results are about the same(80%) and the test(1.95), CV(1.94) and train(1.94) rmse are around the same. This model performs well and has no obvious overfitting or underfitting problem.

  • Decision tree: I used max_depth = 8, min_samples_leaf = 0.13, random_state=3 to instantiate decision tree. The R square for both training and testing data is 72%, which is lower than using linear regression. Also, the test(2.32), CV(2.31), and train(2.31) RMSE are around the same. Therefore, there is no obvious overfitting and underfitting problem. I then tuned the model with max_depth = 10, max_leaf_nodes = 50, min_sample_leaf = 2 to see if it improves the performance. The R-square for both training and testing is 74%. The train(2.22) and CV(2.23) RMSE are around the same yet the test set RMSE is slightly higher(2.32), showing the sign of overfitting. Even though the R square increases 2%, considering it showing the sign of overfitting, the previous DT model is better.

  • Random forest: I used the default random forest function and the R-square for both training and testing are around 98%. The explanatory power of predictors on target variable is high. The test set of rmse(2.32) is slightly higher than the train rmse(2.22) and the CV is lower(1.58). I tuned the hyperparameter and set the n_estimators=400, min_samples_leaf=0.12. R-square becomes very low(22%) and has overfitting problem(CV 3.85 > test 2.32 > train 2.22). The previous random forset with 98% of explanatory power performs better.

  • Gradient boosting: The R-square score of training and testing are 87%, which is also a good model. Nevertheless, the CV rmse(1.57) < test set rmse(1.58) < training rmsa(2.22). This model has slightly underfitting problem.

From above, since linear regression model performs well (r square = 80%) with no overfitting and underfitting problem, I will use it to predict the interest rate. In addition, random forest has 98% of explanatory problem with no serious overfitting problem.I will also use it to predict the interest rate. To improve model performance, will try to do clustering techniques on job title and loan title and include them as predictors in the future.

4. Apply model to the dataset that we want to predict & save the results¶

Apply all the modifications done in the training set to testing set¶
In [259]:
test_data = pd.read_csv("Holdout for Testing.csv")
test_data.head()
Out[259]:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32
0 NaN 44409194 47416907 $6,000 $6,000 $6,000 36 months C C5 Electrician 10+ years MORTGAGE 68000.0 VERIFIED - income 15-Mar NaN debt_consolidation Debt consolidation 430xx OH 28.31 0 2-Nov 1 26.0 NaN 18 0 19861 64.50% 33 f
1 NaN 44017917 47034722 $24,000 $24,000 $24,000 36 months A A1 Executive Assistant 8 years RENT 110480.0 VERIFIED - income 15-Mar NaN debt_consolidation Debt consolidation 945xx CA 16.03 0 Dec-68 1 NaN NaN 12 0 17001 26.20% 36 w
2 NaN 44259158 47306871 $35,000 $35,000 $35,000 36 months C C2 District Sales Leader 10+ years MORTGAGE 86000.0 VERIFIED - income 15-Mar NaN debt_consolidation Debt consolidation 195xx PA 32.49 0 Oct-98 0 NaN NaN 16 0 25797 49.90% 33 w
3 NaN 44429213 47476932 $10,000 $10,000 $10,000 60 months D D1 pharmacy associate 10+ years RENT 30000.0 VERIFIED - income source 15-Mar NaN debt_consolidation Debt consolidation 921xx CA 32.96 0 Feb-99 1 NaN 114.0 13 1 9586 43.80% 21 w
4 NaN 44299188 47346901 $24,000 $24,000 $24,000 60 months B B1 Medical case manager 10+ years MORTGAGE 82500.0 VERIFIED - income 15-Mar NaN debt_consolidation Debt consolidation 196xx PA 31.03 0 2-Dec 0 48.0 NaN 27 0 31842 41.30% 43 w
In [260]:
test_data.shape
Out[260]:
(80000, 32)
In [261]:
test_data = test_data.drop("X1", axis=1)
In [262]:
test_data.columns = [#'interest_rate',
                    'loan_id',
                    'borrower_id',
                    'loan_requested',
                    'loan_funded',
                    'investor_funded',
                    'payment_numbers',
                    'loan_grade',
                    'loan_subgrade',
                    'job_title',
                    'years_employed',
                    'home_ownership',
                    'annual_income',
                    'income_verification',
                    'issued_date',
                    'loan_reason',
                    'loan_category',
                    'loan_title',
                    'zip_code',
                    'state',
                    'debt_payment_ratio',
                    'delinquency_frequency',
                    'credit_earliest_date',
                    'creditor_inquiries',
                    'borrower_delinquency_recency',
                    'public_record_recency',
                    'open_credit_line_count',
                    'derogatory_public_record_count',
                    'total_credit',
                    'credit_utilization_rate',
                    'total_credit_line_count',
                    'initial_status'
                    ]
In [263]:
test_data.head()
Out[263]:
loan_id borrower_id loan_requested loan_funded investor_funded payment_numbers loan_grade loan_subgrade job_title years_employed home_ownership annual_income income_verification issued_date loan_reason loan_category loan_title zip_code state debt_payment_ratio delinquency_frequency credit_earliest_date creditor_inquiries borrower_delinquency_recency public_record_recency open_credit_line_count derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status
0 44409194 47416907 $6,000 $6,000 $6,000 36 months C C5 Electrician 10+ years MORTGAGE 68000.0 VERIFIED - income 15-Mar NaN debt_consolidation Debt consolidation 430xx OH 28.31 0 2-Nov 1 26.0 NaN 18 0 19861 64.50% 33 f
1 44017917 47034722 $24,000 $24,000 $24,000 36 months A A1 Executive Assistant 8 years RENT 110480.0 VERIFIED - income 15-Mar NaN debt_consolidation Debt consolidation 945xx CA 16.03 0 Dec-68 1 NaN NaN 12 0 17001 26.20% 36 w
2 44259158 47306871 $35,000 $35,000 $35,000 36 months C C2 District Sales Leader 10+ years MORTGAGE 86000.0 VERIFIED - income 15-Mar NaN debt_consolidation Debt consolidation 195xx PA 32.49 0 Oct-98 0 NaN NaN 16 0 25797 49.90% 33 w
3 44429213 47476932 $10,000 $10,000 $10,000 60 months D D1 pharmacy associate 10+ years RENT 30000.0 VERIFIED - income source 15-Mar NaN debt_consolidation Debt consolidation 921xx CA 32.96 0 Feb-99 1 NaN 114.0 13 1 9586 43.80% 21 w
4 44299188 47346901 $24,000 $24,000 $24,000 60 months B B1 Medical case manager 10+ years MORTGAGE 82500.0 VERIFIED - income 15-Mar NaN debt_consolidation Debt consolidation 196xx PA 31.03 0 2-Dec 0 48.0 NaN 27 0 31842 41.30% 43 w
In [264]:
test_data.isna().sum()
Out[264]:
loan_id                               0
borrower_id                           0
loan_requested                        0
loan_funded                           0
investor_funded                       0
payment_numbers                       0
loan_grade                            0
loan_subgrade                         0
job_title                          4394
years_employed                     4382
home_ownership                        0
annual_income                         0
income_verification                   0
issued_date                           0
loan_reason                       79985
loan_category                         0
loan_title                            0
zip_code                              0
state                                 0
debt_payment_ratio                    0
delinquency_frequency                 0
credit_earliest_date                  0
creditor_inquiries                    0
borrower_delinquency_recency      38704
public_record_recency             66161
open_credit_line_count                0
derogatory_public_record_count        0
total_credit                          0
credit_utilization_rate              30
total_credit_line_count               0
initial_status                        0
dtype: int64
In [265]:
# drop na in target variable
#test_data = test_data.dropna(subset=['interest_rate'])

# drop variables that are not predictive to target variable
test_data = test_data.drop(columns = ['loan_reason', 'borrower_delinquency_recency', 'public_record_recency'])
test_data = test_data.drop(columns = ['loan_id', 'borrower_id', 'loan_subgrade', 'zip_code'])
In [266]:
test_data.isna().sum()
Out[266]:
loan_requested                       0
loan_funded                          0
investor_funded                      0
payment_numbers                      0
loan_grade                           0
job_title                         4394
years_employed                    4382
home_ownership                       0
annual_income                        0
income_verification                  0
issued_date                          0
loan_category                        0
loan_title                           0
state                                0
debt_payment_ratio                   0
delinquency_frequency                0
credit_earliest_date                 0
creditor_inquiries                   0
open_credit_line_count               0
derogatory_public_record_count       0
total_credit                         0
credit_utilization_rate             30
total_credit_line_count              0
initial_status                       0
dtype: int64
In [267]:
# issued_date
test_data['issued_month'] = test_data['issued_date'].apply(lambda x: str(x)[-3:])
test_data['issued_year'] = test_data['issued_date'].apply(lambda x: str(x)[:2])
In [268]:
# change issued_date & credit_earliest_date's Month & Year

test_data['issued_month'] = test_data['issued_month'].apply(lambda x: mtn(x))
test_data['issued_year'] = test_data['issued_year'].apply(lambda x: year(x))
In [269]:
# test_data['credit_earliest_date'].unique()
In [270]:
test_data.head()
Out[270]:
loan_requested loan_funded investor_funded payment_numbers loan_grade job_title years_employed home_ownership annual_income income_verification issued_date loan_category loan_title state debt_payment_ratio delinquency_frequency credit_earliest_date creditor_inquiries open_credit_line_count derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status issued_month issued_year
0 $6,000 $6,000 $6,000 36 months C Electrician 10+ years MORTGAGE 68000.0 VERIFIED - income 15-Mar debt_consolidation Debt consolidation OH 28.31 0 2-Nov 1 18 0 19861 64.50% 33 f 3 2015
1 $24,000 $24,000 $24,000 36 months A Executive Assistant 8 years RENT 110480.0 VERIFIED - income 15-Mar debt_consolidation Debt consolidation CA 16.03 0 Dec-68 1 12 0 17001 26.20% 36 w 3 2015
2 $35,000 $35,000 $35,000 36 months C District Sales Leader 10+ years MORTGAGE 86000.0 VERIFIED - income 15-Mar debt_consolidation Debt consolidation PA 32.49 0 Oct-98 0 16 0 25797 49.90% 33 w 3 2015
3 $10,000 $10,000 $10,000 60 months D pharmacy associate 10+ years RENT 30000.0 VERIFIED - income source 15-Mar debt_consolidation Debt consolidation CA 32.96 0 Feb-99 1 13 1 9586 43.80% 21 w 3 2015
4 $24,000 $24,000 $24,000 60 months B Medical case manager 10+ years MORTGAGE 82500.0 VERIFIED - income 15-Mar debt_consolidation Debt consolidation PA 31.03 0 2-Dec 0 27 0 31842 41.30% 43 w 3 2015
In [271]:
test_data = test_data.drop(['issued_date', 'credit_earliest_date' ], axis=1)
In [273]:
test_data.shape
Out[273]:
(80000, 24)
In [275]:
for col in ['loan_requested', 'loan_funded', 'investor_funded']:
    test_data[col] = test_data[col].str.slice(start=1).str.replace(",","").astype("float")
    
for col in ['credit_utilization_rate']:
    test_data[col] = test_data[col].str.slice(stop=-1).astype("float")
    
test_data['loan_requested'].fillna(test_data['loan_requested'].median(), inplace=True)
test_data['loan_funded'].fillna(test_data['loan_funded'].median(), inplace=True)
test_data['investor_funded'].fillna(test_data['investor_funded'].median(), inplace=True)
In [276]:
test_data['is_36months'] = test_data['payment_numbers'].apply(lambda x: 1 if x == '36 months' else 0)
test_data = test_data.drop('payment_numbers', axis=1)
test_data.loan_grade = test_data.loan_grade.fillna('Other')

test_data.job_title = test_data.job_title.fillna('Other')
test_data = test_data.drop('job_title', axis=1)


test_data['years_employed'] = test_data['years_employed'].str.replace("< 1", "0").replace("+", "")
test_data['years_employed'] = test_data['years_employed'].str.strip()
test_data['years_employed'] = test_data['years_employed'].str.slice(stop=2)
test_data['years_employed'] = test_data['years_employed'].astype('float')
test_data['years_employed'] = test_data['years_employed'].interpolate()


year_condition = [(test_data['years_employed'] >= 0) & (test_data['years_employed'] < 6), 
                  (test_data['years_employed'] >= 6) & (test_data['years_employed'] < 10),
                 test_data['years_employed'] >= 10]
In [277]:
test_data.shape
Out[277]:
(80000, 23)
In [278]:
year_em_categories = ['G1', 'G2', 'G3']
test_data['YE_categories'] = np.select(year_condition, year_em_categories)

test_data["YE_categories"] = test_data["YE_categories"].astype("category")
test_data = test_data.drop('years_employed', axis=1)


test_data["home_ownership"] = test_data["home_ownership"].str.replace("NONE", "OTHER").str.replace("ANY", "OTHER")
test_data["home_ownership"] = test_data["home_ownership"] .fillna("OTHER")
In [279]:
test_data.shape
Out[279]:
(80000, 23)
In [280]:
test_data.annual_income = test_data.annual_income.interpolate()

income_condition = [(test_data['annual_income'] >= 0) & (test_data['annual_income'] < 100000), 
                  (test_data['annual_income'] >= 100000) & (test_data['annual_income'] < 200000),
                  (test_data['annual_income'] >= 200000) & (test_data['annual_income'] < 300000),
                  (test_data['annual_income'] >= 300000) & (test_data['annual_income'] < 400000),
                  (test_data['annual_income'] >= 400000) & (test_data['annual_income'] < 500000),
                  (test_data['annual_income'] >= 500000) & (test_data['annual_income'] < 600000),
                  (test_data['annual_income'] >= 600000) & (test_data['annual_income'] < 700000),
                 test_data['annual_income'] >= 700000]
income_categories = ['G1', 'G2', 'G3', 'G4', 'G5', 'G6', 'G7', 'G8']
test_data['income_categories'] = np.select(income_condition, income_categories)

test_data['income_over_100000'] = test_data['income_categories'].apply(lambda x:1 if x =='G1' else 0)
# drop income categories
test_data = test_data.drop(columns = ['income_categories'])
In [281]:
test_data.shape
Out[281]:
(80000, 24)
In [282]:
test_data["income_verification"] = test_data["income_verification"].str.replace("VERIFIED - income", "verified").str.replace("verified source","verified")
test_data["income_verification"]  = test_data["income_verification"] .fillna("verified")


test_data = test_data.drop(columns = ['loan_title'])
In [283]:
test_data.shape
Out[283]:
(80000, 23)
In [284]:
states = {
    "AK": "Other",
    "AL": "South",
    "AR": "South",
    "AS": "Other",
    "AZ": "West",
    "CA": "West",
    "CO": "West",
    "CT": "Northeast",
    "DC": "Northeast",
    "DE": "Northeast",
    "FL": "South",
    "GA": "South",
    "GU": "Other",
    "HI": "Other",
    "IA": "Midwest",
    "ID": "West",
    "IL": "Midwest",
    "IN": "Midwest",
    "KS": "Midwest",
    "KY": "South",
    "LA": "South",
    "MA": "Northeast",
    "MD": "Northeast",
    "ME": "Northeast",
    "MI": "West",
    "MN": "Midwest",
    "MO": "Midwest",
    "MP": "Other",
    "MS": "South",
    "MT": "West",
    "NA": "Other",
    "NC": "South",
    "ND": "Midwest",
    "NE": "West",
    "NH": "Northeast",
    "NJ": "Northeast",
    "NM": "West",
    "NV": "West",
    "NY": "Northeast",
    "OH": "Midwest",
    "OK": "South",
    "OR": "West",
    "PA": "Northeast",
    "PR": "Other",
    "RI": "Northeast",
    "SC": "South",
    "SD": "Midwest",
    "TN": "South",
    "TX": "South",
    "UT": "West",
    "VA": "South",
    "VI": "Other",
    "VT": "Northeast",
    "WA": "West",
    "WI": "Midwest",
    "WV": "South",
    "WY": "West",
}

test_data["region"] = test_data["state"].map(states)

test_data = test_data.drop("state", axis=1)
In [285]:
test_data.shape
Out[285]:
(80000, 23)
In [286]:
test_data.head()
Out[286]:
loan_requested loan_funded investor_funded loan_grade home_ownership annual_income income_verification loan_category debt_payment_ratio delinquency_frequency creditor_inquiries open_credit_line_count derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status issued_month issued_year is_36months YE_categories income_over_100000 region
0 6000.0 6000.0 6000.0 C MORTGAGE 68000.0 verified debt_consolidation 28.31 0 1 18 0 19861 64.5 33 f 3 2015 0 G3 1 Midwest
1 24000.0 24000.0 24000.0 A RENT 110480.0 verified debt_consolidation 16.03 0 1 12 0 17001 26.2 36 w 3 2015 0 G2 0 West
2 35000.0 35000.0 35000.0 C MORTGAGE 86000.0 verified debt_consolidation 32.49 0 0 16 0 25797 49.9 33 w 3 2015 0 G3 1 Northeast
3 10000.0 10000.0 10000.0 D RENT 30000.0 verified debt_consolidation 32.96 0 1 13 1 9586 43.8 21 w 3 2015 0 G3 1 West
4 24000.0 24000.0 24000.0 B MORTGAGE 82500.0 verified debt_consolidation 31.03 0 0 27 0 31842 41.3 43 w 3 2015 0 G3 1 Northeast
In [287]:
test_data['has_inquiry'] = test_data['creditor_inquiries'].apply(lambda x: 1 if x != 0 else 0)
In [288]:
test_data.shape
Out[288]:
(80000, 24)
In [289]:
test_data['log_open_credit_line'] = np.log10(test_data['open_credit_line_count'])
test_data['log_open_credit_line'].replace([-np.inf], 0, inplace = True)
test_data = test_data.drop(['open_credit_line_count'], axis=1)
In [290]:
test_data.shape
Out[290]:
(80000, 24)
In [291]:
test_data['derogatory_public_record'] = test_data['derogatory_public_record_count'].apply(lambda x: 1 if x != 0 else 0)

#test_data = test_data[test_data['total_credit'] <= 40889]
test_data['credit_utilization_rate'] = test_data['credit_utilization_rate'].interpolate()

#test_data = test_data[test_data['credit_utilization_rate'] <= 127.4]

#test_data = test_data[test_data['total_credit_line_count'] <= 53]
In [292]:
test_data.shape
Out[292]:
(80000, 25)
In [293]:
test_data['has_delinquency'] = test_data['delinquency_frequency'].apply(lambda x: 1 if x != 0 else 0)
test_data = test_data.drop(columns = ['delinquency_frequency'])
In [294]:
test_data.isna().sum()
Out[294]:
loan_requested                    0
loan_funded                       0
investor_funded                   0
loan_grade                        0
home_ownership                    0
annual_income                     0
income_verification               0
loan_category                     0
debt_payment_ratio                0
creditor_inquiries                0
derogatory_public_record_count    0
total_credit                      0
credit_utilization_rate           0
total_credit_line_count           0
initial_status                    0
issued_month                      0
issued_year                       0
is_36months                       0
YE_categories                     0
income_over_100000                0
region                            0
has_inquiry                       0
log_open_credit_line              0
derogatory_public_record          0
has_delinquency                   0
dtype: int64
In [295]:
test_data.shape
Out[295]:
(80000, 25)
In [296]:
loan_data.shape
Out[296]:
(317177, 27)
In [297]:
test_data.head()
Out[297]:
loan_requested loan_funded investor_funded loan_grade home_ownership annual_income income_verification loan_category debt_payment_ratio creditor_inquiries derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status issued_month issued_year is_36months YE_categories income_over_100000 region has_inquiry log_open_credit_line derogatory_public_record has_delinquency
0 6000.0 6000.0 6000.0 C MORTGAGE 68000.0 verified debt_consolidation 28.31 1 0 19861 64.5 33 f 3 2015 0 G3 1 Midwest 1 1.255273 0 0
1 24000.0 24000.0 24000.0 A RENT 110480.0 verified debt_consolidation 16.03 1 0 17001 26.2 36 w 3 2015 0 G2 0 West 1 1.079181 0 0
2 35000.0 35000.0 35000.0 C MORTGAGE 86000.0 verified debt_consolidation 32.49 0 0 25797 49.9 33 w 3 2015 0 G3 1 Northeast 0 1.204120 0 0
3 10000.0 10000.0 10000.0 D RENT 30000.0 verified debt_consolidation 32.96 1 1 9586 43.8 21 w 3 2015 0 G3 1 West 1 1.113943 1 0
4 24000.0 24000.0 24000.0 B MORTGAGE 82500.0 verified debt_consolidation 31.03 0 0 31842 41.3 43 w 3 2015 0 G3 1 Northeast 0 1.431364 0 0
In [298]:
test_data_df = test_data.drop(columns = ['loan_funded', 'investor_funded'])
In [299]:
test_data_df.shape
Out[299]:
(80000, 23)
In [300]:
loan_data_df.head()
Out[300]:
interest_rate loan_requested annual_income debt_payment_ratio derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count issued_month issued_year log_open_credit_line YE_categories_G1 YE_categories_G2 YE_categories_G3 home_ownership_MORTGAGE home_ownership_OTHER home_ownership_OWN home_ownership_RENT loan_grade_A loan_grade_B loan_grade_C loan_grade_D loan_grade_E loan_grade_F loan_grade_G loan_grade_Other income_verification_not verified income_verification_verified initial_status_f initial_status_w loan_category_car loan_category_credit_card loan_category_debt_consolidation loan_category_educational loan_category_home_improvement loan_category_house loan_category_major_purchase loan_category_medical loan_category_moving loan_category_other loan_category_renewable_energy loan_category_small_business loan_category_vacation loan_category_wedding is_36months_0 income_over_100000_0 income_over_100000_1 region_Midwest region_Northeast region_Other region_South region_West has_delinquency_0 has_delinquency_1 has_inquiry_0 has_inquiry_1 derogatory_public_record_0 derogatory_public_record_1
0 11.89 25000.0 85000.0 19.48 0.0 28854.0 52.1 42.0 8 2009 1.000000 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0
1 10.71 7000.0 65000.0 14.29 0.0 33623.0 76.7 7.0 5 2008 0.845098 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0
2 16.99 25000.0 70000.0 10.50 0.0 19878.0 66.3 17.0 8 2014 1.000000 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0
3 13.11 1200.0 54000.0 5.47 0.0 2584.0 40.4 31.0 3 2010 0.698970 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0
4 13.57 10800.0 32000.0 11.63 0.0 3511.0 25.6 40.0 11 2009 1.146128 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1 1 0
In [301]:
loan_data_df.home_ownership_OTHER.value_counts()
Out[301]:
0    268431
1     48746
Name: home_ownership_OTHER, dtype: int64
In [302]:
loan_data_df.loan_category_educational.value_counts()
Out[302]:
0    316916
1       261
Name: loan_category_educational, dtype: int64
In [303]:
316916/(316916+261)
Out[303]:
0.9991771156168322
In [304]:
1-0.9991771156168322
Out[304]:
0.000822884383167799
In [305]:
test_data_df.shape
Out[305]:
(80000, 23)
In [306]:
test_data_df.head()
Out[306]:
loan_requested loan_grade home_ownership annual_income income_verification loan_category debt_payment_ratio creditor_inquiries derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count initial_status issued_month issued_year is_36months YE_categories income_over_100000 region has_inquiry log_open_credit_line derogatory_public_record has_delinquency
0 6000.0 C MORTGAGE 68000.0 verified debt_consolidation 28.31 1 0 19861 64.5 33 f 3 2015 0 G3 1 Midwest 1 1.255273 0 0
1 24000.0 A RENT 110480.0 verified debt_consolidation 16.03 1 0 17001 26.2 36 w 3 2015 0 G2 0 West 1 1.079181 0 0
2 35000.0 C MORTGAGE 86000.0 verified debt_consolidation 32.49 0 0 25797 49.9 33 w 3 2015 0 G3 1 Northeast 0 1.204120 0 0
3 10000.0 D RENT 30000.0 verified debt_consolidation 32.96 1 1 9586 43.8 21 w 3 2015 0 G3 1 West 1 1.113943 1 0
4 24000.0 B MORTGAGE 82500.0 verified debt_consolidation 31.03 0 0 31842 41.3 43 w 3 2015 0 G3 1 Northeast 0 1.431364 0 0
In [307]:
test_data_df = pd.get_dummies(test_data_df, columns = ['YE_categories', 
                                                       'home_ownership',
                                                       'loan_grade', 
                                                       'income_verification', 
                                                       'initial_status', 
                                                       'loan_category', 
                                                       #'issued_month', 
                                                       #'issued_year', 
                                                       #'credit_earliest_month', 
                                                       #'credit_earliest_year', 
                                                       'is_36months', 
                                                       'income_over_100000', 
                                                       'region', 
                                                       'has_delinquency', 
                                                       'has_inquiry',
                                                      'derogatory_public_record'])
In [308]:
col = list(test_data_df.columns)
col2 = list(loan_data_df.columns)
In [309]:
for c in loan_data_df.columns:
    if c not in test_data_df.columns:
        print(c)
interest_rate
home_ownership_OTHER
loan_grade_Other
loan_category_educational
In [310]:
# filling in the missing columns and fill in 0, 1 values according to the proportion in training set
test_data_df['home_ownership_OTHER'] = np.random.choice([0, 1], size=len(test_data_df), p=[0.84, 0.16])
test_data_df['loan_grade_Other'] = np.random.choice([0, 1], size=len(test_data_df), p=[0.85, 0.15])
test_data_df['loan_category_educational'] = np.random.choice([0, 1], size=len(test_data_df), p=[0.99, 0.01])
In [311]:
test_data_df.shape
Out[311]:
(80000, 58)
In [312]:
loan_data_df.shape
Out[312]:
(317177, 58)
In [313]:
test_data_df.columns
Out[313]:
Index(['loan_requested', 'annual_income', 'debt_payment_ratio',
       'creditor_inquiries', 'derogatory_public_record_count', 'total_credit',
       'credit_utilization_rate', 'total_credit_line_count', 'issued_month',
       'issued_year', 'log_open_credit_line', 'YE_categories_G1',
       'YE_categories_G2', 'YE_categories_G3', 'home_ownership_MORTGAGE',
       'home_ownership_OWN', 'home_ownership_RENT', 'loan_grade_A',
       'loan_grade_B', 'loan_grade_C', 'loan_grade_D', 'loan_grade_E',
       'loan_grade_F', 'loan_grade_G', 'income_verification_not verified',
       'income_verification_verified', 'initial_status_f', 'initial_status_w',
       'loan_category_car', 'loan_category_credit_card',
       'loan_category_debt_consolidation', 'loan_category_home_improvement',
       'loan_category_house', 'loan_category_major_purchase',
       'loan_category_medical', 'loan_category_moving', 'loan_category_other',
       'loan_category_renewable_energy', 'loan_category_small_business',
       'loan_category_vacation', 'loan_category_wedding', 'is_36months_0',
       'income_over_100000_0', 'income_over_100000_1', 'region_Midwest',
       'region_Northeast', 'region_Other', 'region_South', 'region_West',
       'has_delinquency_0', 'has_delinquency_1', 'has_inquiry_0',
       'has_inquiry_1', 'derogatory_public_record_0',
       'derogatory_public_record_1', 'home_ownership_OTHER',
       'loan_grade_Other', 'loan_category_educational'],
      dtype='object')
In [314]:
loan_data_df.columns
Out[314]:
Index(['interest_rate', 'loan_requested', 'annual_income',
       'debt_payment_ratio', 'derogatory_public_record_count', 'total_credit',
       'credit_utilization_rate', 'total_credit_line_count', 'issued_month',
       'issued_year', 'log_open_credit_line', 'YE_categories_G1',
       'YE_categories_G2', 'YE_categories_G3', 'home_ownership_MORTGAGE',
       'home_ownership_OTHER', 'home_ownership_OWN', 'home_ownership_RENT',
       'loan_grade_A', 'loan_grade_B', 'loan_grade_C', 'loan_grade_D',
       'loan_grade_E', 'loan_grade_F', 'loan_grade_G', 'loan_grade_Other',
       'income_verification_not verified', 'income_verification_verified',
       'initial_status_f', 'initial_status_w', 'loan_category_car',
       'loan_category_credit_card', 'loan_category_debt_consolidation',
       'loan_category_educational', 'loan_category_home_improvement',
       'loan_category_house', 'loan_category_major_purchase',
       'loan_category_medical', 'loan_category_moving', 'loan_category_other',
       'loan_category_renewable_energy', 'loan_category_small_business',
       'loan_category_vacation', 'loan_category_wedding', 'is_36months_0',
       'income_over_100000_0', 'income_over_100000_1', 'region_Midwest',
       'region_Northeast', 'region_Other', 'region_South', 'region_West',
       'has_delinquency_0', 'has_delinquency_1', 'has_inquiry_0',
       'has_inquiry_1', 'derogatory_public_record_0',
       'derogatory_public_record_1'],
      dtype='object')
In [315]:
col = ['loan_requested', 
       'annual_income',
       'debt_payment_ratio',
       'derogatory_public_record_count',
       'total_credit',
       'credit_utilization_rate',
       'total_credit_line_count',
       'issued_month',
       'issued_year',
       #'credit_earliest_month',
       #'credit_earliest_year',
       'log_open_credit_line', 
       'YE_categories_G1', 
       'YE_categories_G2',
       'YE_categories_G3', 
       'home_ownership_MORTGAGE', 
       'home_ownership_OTHER',
       'home_ownership_OWN', 
       'home_ownership_RENT', 
       'loan_grade_A',
       'loan_grade_B',
       'loan_grade_C',
       'loan_grade_D',
       'loan_grade_E',
       'loan_grade_F',
       'loan_grade_G',
       'loan_grade_Other',
       'income_verification_not verified',
       'income_verification_verified',
       'initial_status_f',
       'initial_status_w',
       'loan_category_car',
       'loan_category_credit_card',
       'loan_category_debt_consolidation',
       'loan_category_educational',
       'loan_category_home_improvement',
       'loan_category_house',
       'loan_category_major_purchase',
       'loan_category_medical',
       'loan_category_moving',
       'loan_category_other',
       'loan_category_renewable_energy',
       'loan_category_small_business',
       'loan_category_vacation',
       'loan_category_wedding',
       'is_36months_0',
       'income_over_100000_0',
       'income_over_100000_1',
       'region_Midwest',
       'region_Northeast',
       'region_Other',
       'region_South',
       'region_West',
       'has_delinquency_0',
       'has_delinquency_1',
       'has_inquiry_0',
       'has_inquiry_1',
       'derogatory_public_record_0',
       'derogatory_public_record_1']
In [316]:
test_data_df = test_data_df[col]
In [317]:
test_data_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80000 entries, 0 to 79999
Data columns (total 57 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   loan_requested                    80000 non-null  float64
 1   annual_income                     80000 non-null  float64
 2   debt_payment_ratio                80000 non-null  float64
 3   derogatory_public_record_count    80000 non-null  int64  
 4   total_credit                      80000 non-null  int64  
 5   credit_utilization_rate           80000 non-null  float64
 6   total_credit_line_count           80000 non-null  int64  
 7   issued_month                      80000 non-null  int64  
 8   issued_year                       80000 non-null  int64  
 9   log_open_credit_line              80000 non-null  float64
 10  YE_categories_G1                  80000 non-null  uint8  
 11  YE_categories_G2                  80000 non-null  uint8  
 12  YE_categories_G3                  80000 non-null  uint8  
 13  home_ownership_MORTGAGE           80000 non-null  uint8  
 14  home_ownership_OTHER              80000 non-null  int64  
 15  home_ownership_OWN                80000 non-null  uint8  
 16  home_ownership_RENT               80000 non-null  uint8  
 17  loan_grade_A                      80000 non-null  uint8  
 18  loan_grade_B                      80000 non-null  uint8  
 19  loan_grade_C                      80000 non-null  uint8  
 20  loan_grade_D                      80000 non-null  uint8  
 21  loan_grade_E                      80000 non-null  uint8  
 22  loan_grade_F                      80000 non-null  uint8  
 23  loan_grade_G                      80000 non-null  uint8  
 24  loan_grade_Other                  80000 non-null  int64  
 25  income_verification_not verified  80000 non-null  uint8  
 26  income_verification_verified      80000 non-null  uint8  
 27  initial_status_f                  80000 non-null  uint8  
 28  initial_status_w                  80000 non-null  uint8  
 29  loan_category_car                 80000 non-null  uint8  
 30  loan_category_credit_card         80000 non-null  uint8  
 31  loan_category_debt_consolidation  80000 non-null  uint8  
 32  loan_category_educational         80000 non-null  int64  
 33  loan_category_home_improvement    80000 non-null  uint8  
 34  loan_category_house               80000 non-null  uint8  
 35  loan_category_major_purchase      80000 non-null  uint8  
 36  loan_category_medical             80000 non-null  uint8  
 37  loan_category_moving              80000 non-null  uint8  
 38  loan_category_other               80000 non-null  uint8  
 39  loan_category_renewable_energy    80000 non-null  uint8  
 40  loan_category_small_business      80000 non-null  uint8  
 41  loan_category_vacation            80000 non-null  uint8  
 42  loan_category_wedding             80000 non-null  uint8  
 43  is_36months_0                     80000 non-null  uint8  
 44  income_over_100000_0              80000 non-null  uint8  
 45  income_over_100000_1              80000 non-null  uint8  
 46  region_Midwest                    80000 non-null  uint8  
 47  region_Northeast                  80000 non-null  uint8  
 48  region_Other                      80000 non-null  uint8  
 49  region_South                      80000 non-null  uint8  
 50  region_West                       80000 non-null  uint8  
 51  has_delinquency_0                 80000 non-null  uint8  
 52  has_delinquency_1                 80000 non-null  uint8  
 53  has_inquiry_0                     80000 non-null  uint8  
 54  has_inquiry_1                     80000 non-null  uint8  
 55  derogatory_public_record_0        80000 non-null  uint8  
 56  derogatory_public_record_1        80000 non-null  uint8  
dtypes: float64(5), int64(8), uint8(44)
memory usage: 11.3 MB
In [318]:
test_data_df.head()
Out[318]:
loan_requested annual_income debt_payment_ratio derogatory_public_record_count total_credit credit_utilization_rate total_credit_line_count issued_month issued_year log_open_credit_line YE_categories_G1 YE_categories_G2 YE_categories_G3 home_ownership_MORTGAGE home_ownership_OTHER home_ownership_OWN home_ownership_RENT loan_grade_A loan_grade_B loan_grade_C loan_grade_D loan_grade_E loan_grade_F loan_grade_G loan_grade_Other income_verification_not verified income_verification_verified initial_status_f initial_status_w loan_category_car loan_category_credit_card loan_category_debt_consolidation loan_category_educational loan_category_home_improvement loan_category_house loan_category_major_purchase loan_category_medical loan_category_moving loan_category_other loan_category_renewable_energy loan_category_small_business loan_category_vacation loan_category_wedding is_36months_0 income_over_100000_0 income_over_100000_1 region_Midwest region_Northeast region_Other region_South region_West has_delinquency_0 has_delinquency_1 has_inquiry_0 has_inquiry_1 derogatory_public_record_0 derogatory_public_record_1
0 6000.0 68000.0 28.31 0 19861 64.5 33 3 2015 1.255273 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 1 1 0
1 24000.0 110480.0 16.03 0 17001 26.2 36 3 2015 1.079181 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1 1 0
2 35000.0 86000.0 32.49 0 25797 49.9 33 3 2015 1.204120 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0
3 10000.0 30000.0 32.96 1 9586 43.8 21 3 2015 1.113943 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1
4 24000.0 82500.0 31.03 0 31842 41.3 43 3 2015 1.431364 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0
In [319]:
loan_data_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 317177 entries, 0 to 399999
Data columns (total 58 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   interest_rate                     317177 non-null  float64
 1   loan_requested                    317177 non-null  float64
 2   annual_income                     317177 non-null  float64
 3   debt_payment_ratio                317177 non-null  float64
 4   derogatory_public_record_count    317177 non-null  float64
 5   total_credit                      317177 non-null  float64
 6   credit_utilization_rate           317177 non-null  float64
 7   total_credit_line_count           317177 non-null  float64
 8   issued_month                      317177 non-null  int64  
 9   issued_year                       317177 non-null  int64  
 10  log_open_credit_line              317177 non-null  float64
 11  YE_categories_G1                  317177 non-null  uint8  
 12  YE_categories_G2                  317177 non-null  uint8  
 13  YE_categories_G3                  317177 non-null  uint8  
 14  home_ownership_MORTGAGE           317177 non-null  uint8  
 15  home_ownership_OTHER              317177 non-null  uint8  
 16  home_ownership_OWN                317177 non-null  uint8  
 17  home_ownership_RENT               317177 non-null  uint8  
 18  loan_grade_A                      317177 non-null  uint8  
 19  loan_grade_B                      317177 non-null  uint8  
 20  loan_grade_C                      317177 non-null  uint8  
 21  loan_grade_D                      317177 non-null  uint8  
 22  loan_grade_E                      317177 non-null  uint8  
 23  loan_grade_F                      317177 non-null  uint8  
 24  loan_grade_G                      317177 non-null  uint8  
 25  loan_grade_Other                  317177 non-null  uint8  
 26  income_verification_not verified  317177 non-null  uint8  
 27  income_verification_verified      317177 non-null  uint8  
 28  initial_status_f                  317177 non-null  uint8  
 29  initial_status_w                  317177 non-null  uint8  
 30  loan_category_car                 317177 non-null  uint8  
 31  loan_category_credit_card         317177 non-null  uint8  
 32  loan_category_debt_consolidation  317177 non-null  uint8  
 33  loan_category_educational         317177 non-null  uint8  
 34  loan_category_home_improvement    317177 non-null  uint8  
 35  loan_category_house               317177 non-null  uint8  
 36  loan_category_major_purchase      317177 non-null  uint8  
 37  loan_category_medical             317177 non-null  uint8  
 38  loan_category_moving              317177 non-null  uint8  
 39  loan_category_other               317177 non-null  uint8  
 40  loan_category_renewable_energy    317177 non-null  uint8  
 41  loan_category_small_business      317177 non-null  uint8  
 42  loan_category_vacation            317177 non-null  uint8  
 43  loan_category_wedding             317177 non-null  uint8  
 44  is_36months_0                     317177 non-null  uint8  
 45  income_over_100000_0              317177 non-null  uint8  
 46  income_over_100000_1              317177 non-null  uint8  
 47  region_Midwest                    317177 non-null  uint8  
 48  region_Northeast                  317177 non-null  uint8  
 49  region_Other                      317177 non-null  uint8  
 50  region_South                      317177 non-null  uint8  
 51  region_West                       317177 non-null  uint8  
 52  has_delinquency_0                 317177 non-null  uint8  
 53  has_delinquency_1                 317177 non-null  uint8  
 54  has_inquiry_0                     317177 non-null  uint8  
 55  has_inquiry_1                     317177 non-null  uint8  
 56  derogatory_public_record_0        317177 non-null  uint8  
 57  derogatory_public_record_1        317177 non-null  uint8  
dtypes: float64(9), int64(2), uint8(47)
memory usage: 43.3 MB
Apply Linear Regression model¶
In [321]:
lr_model = LinearRegression()

lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(test_data_df)
Apply Random Forest¶
In [322]:
rf = RandomForestRegressor(random_state = 42)

# Fit 'rf' to the training set
rf.fit(X_train, y_train)


# Predict the test set labels 'y_pred_test'
y_pred_test = rf.predict(test_data_df)
Save the prediction result to csv file¶
  • Linear Regression
In [323]:
test = pd.read_csv('Holdout for Testing.csv')
predict = pd.DataFrame({
    "Loan_Id": test["X2"], 
    "Predicted_Interest_Rate": y_pred_lr
})
predict.to_csv('Loan_Results_from_Maggie_Chuang_LR2.csv', index=False)
  • Random Forest
In [324]:
test = pd.read_csv('Holdout for Testing.csv')
predict = pd.DataFrame({
    "Loan_Id": test["X2"],
    "Predicted_Interest_Rate": y_pred_test
})
predict.to_csv('Loan_Results_from_Maggie_Chuang_RF2.csv', index=False)